image


Intel® 64 and IA-32 Architectures Software Developer’s Manual

Volume 2B: Instruction Set Reference, M-U


NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of ten volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-L, Order Number 253666; Instruction Set Reference M-U, Order Number 253667; Instruction Set Reference V-Z, Order Number 326018; Instruction Set Reference, Order Number 334569; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669; System Programming Guide, Part 3, Order Number 326019; System Programming Guide, Part 4, Order Number 332831; Model-Specific Registers, Order Number 335592. Refer to all ten volumes when evaluating your design needs.


Order Number: 253667-067US

May 2018



Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifica- tions. Current characterized errata are available on request.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1- 800-548-4725, or by visiting http://www.intel.com/design/literature.htm.

Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others. Copyright © 1997-2018, Intel Corporation. All Rights Reserved.

CHAPTER 4 INSTRUCTION SET REFERENCE, M-U

image


    1. IMM8 CONTROL BYTE OPERATION FOR PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM

      The notations introduced in this section are referenced in the reference pages of PCMPESTRI, PCMPESTRM, PCMP- ISTRI, PCMPISTRM. The operation of the immediate control byte is common to these four string text processing instructions of SSE4.2. This section describes the common operations.


      1. General Description

        The operation of PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM is defined by the combination of the respec- tive opcode and the interpretation of an immediate control byte that is part of the instruction encoding.

        The opcode controls the relationship of input bytes/words to each other (determines whether the inputs terminated strings or whether lengths are expressed explicitly) as well as the desired output (index or mask).

        The Imm8 Control Byte for PCMPESTRM/PCMPESTRI/PCMPISTRM/PCMPISTRI encodes a significant amount of programmable control over the functionality of those instructions. Some functionality is unique to each instruction while some is common across some or all of the four instructions. This section describes functionality which is common across the four instructions.

        The arithmetic flags (ZF, CF, SF, OF, AF, PF) are set as a result of these instructions. However, the meanings of the flags have been overloaded from their typical meanings in order to provide additional information regarding the relationships of the two inputs.

        PCMPxSTRx instructions perform arithmetic comparisons between all possible pairs of bytes or words, one from each packed input source operand. The boolean results of those comparisons are then aggregated in order to produce meaningful results. The Imm8 Control Byte is used to affect the interpretation of individual input elements as well as control the arithmetic comparisons used and the specific aggregation scheme.

        Specifically, the Imm8 Control Byte consists of bit fields that control the following attributes:

        • Source data format — Byte/word data element granularity, signed or unsigned elements

        • Aggregation operation — Encodes the mode of per-element comparison operation and the aggregation of per-element comparisons into an intermediate result

        • Polarity — Specifies intermediate processing to be performed on the intermediate result

        • Output selection — Specifies final operation to produce the output (depending on index or mask) from the intermediate result


      2. Source Data Format


        Table 4-1. Source Data Format


        Imm8[1:0]

        Meaning

        Description

        00b

        Unsigned bytes

        Both 128-bit sources are treated as packed, unsigned bytes.

        01b

        Unsigned words

        Both 128-bit sources are treated as packed, unsigned words.

        10b

        Signed bytes

        Both 128-bit sources are treated as packed, signed bytes.

        11b

        Signed words

        Both 128-bit sources are treated as packed, signed words.


        If the Imm8 Control Byte has bit[0] cleared, each source contains 16 packed bytes. If the bit is set each source contains 8 packed words. If the Imm8 Control Byte has bit[1] cleared, each input contains unsigned data. If the bit is set each source contains signed data.


      3. Aggregation Operation


        Table 4-2. Aggregation Operation


        Imm8[3:2]

        Mode

        Comparison

        00b

        Equal any

        The arithmetic comparison is “equal.”

        01b

        Ranges

        Arithmetic comparison is “greater than or equal” between even indexed bytes/words of reg and each byte/word of reg/mem.



        Arithmetic comparison is “less than or equal” between odd indexed bytes/words of reg and each byte/word of reg/mem.



        (reg/mem[m] >= reg[n] for n = even, reg/mem[m] <= reg[n] for n = odd)

        10b

        Equal each

        The arithmetic comparison is “equal.”

        11b

        Equal ordered

        The arithmetic comparison is “equal.”


        All 256 (64) possible comparisons are always performed. The individual Boolean results of those comparisons are referred by “BoolRes[Reg/Mem element index, Reg element index].” Comparisons evaluating to “True” are repre- sented with a 1, False with a 0 (positive logic). The initial results are then aggregated into a 16-bit (8-bit) interme- diate result (IntRes1) using one of the modes described in the table below, as determined by Imm8 Control Byte bit[3:2].



        See Section 4.1.6 for a description of the overrideIfDataInvalid() function used in Table 4-3.

        Table 4-3. Aggregation Operation


        Mode

        Pseudocode

        Equal any

        UpperBound = imm8[0] ? 7 : 15;

        (find characters from a set)

        IntRes1 = 0;


        For j = 0 to UpperBound, j++


        For i = 0 to UpperBound, i++


        IntRes1[j] OR= overrideIfDataInvalid(BoolRes[j,i])

        Ranges

        UpperBound = imm8[0] ? 7 : 15;

        (find characters from ranges)

        IntRes1 = 0;


        For j = 0 to UpperBound, j++


        For i = 0 to UpperBound, i+=2


        IntRes1[j] OR= (overrideIfDataInvalid(BoolRes[j,i]) AND overrideIfDataInvalid(BoolRes[j,i+1]))

        Equal each

        UpperBound = imm8[0] ? 7 : 15;

        (string compare)

        IntRes1 = 0;


        For i = 0 to UpperBound, i++


        IntRes1[i] = overrideIfDataInvalid(BoolRes[i,i])

        Equal ordered

        UpperBound = imm8[0] ? 7 :15;

        (substring search)

        IntRes1 = imm8[0] ? FFH : FFFFH


        For j = 0 to UpperBound, j++


        For i = 0 to UpperBound-j, k=j to UpperBound, k++, i++


        IntRes1[j] AND= overrideIfDataInvalid(BoolRes[k,i])


      4. Polarity

        IntRes1 may then be further modified by performing a 1’s complement, according to the value of the Imm8 Control Byte bit[4]. Optionally, a mask may be used such that only those IntRes1 bits which correspond to “valid” reg/mem input elements are complemented (note that the definition of a valid input element is dependant on the specific opcode and is defined in each opcode’s description). The result of the possible negation is referred to as IntRes2.


        Table 4-4. Polarity


        Imm8[5:4]

        Operation

        Description

        00b

        Positive Polarity (+)

        IntRes2 = IntRes1

        01b

        Negative Polarity (-)

        IntRes2 = -1 XOR IntRes1

        10b

        Masked (+)

        IntRes2 = IntRes1

        11b

        Masked (-)

        IntRes2[i] = IntRes1[i] if reg/mem[i] invalid, else = ~IntRes1[i]


      5. Output Selection


        Table 4-5. Output Selection


        Imm8[6]

        Operation

        Description

        0b

        1b

        Least significant index

        Most significant index

        The index returned to ECX is of the least significant set bit in IntRes2.

        The index returned to ECX is of the most significant set bit in IntRes2.


        For PCMPESTRI/PCMPISTRI, the Imm8 Control Byte bit[6] is used to determine if the index is of the least significant or most significant bit of IntRes2.


        Table 4-6. Output Selection

        Imm8[6]

        Operation

        Description

        0b


        1b

        Bit mask


        Byte/word mask

        IntRes2 is returned as the mask to the least significant bits of XMM0 with zero extension to 128 bits.

        IntRes2 is expanded into a byte/word mask (based on imm8[1]) and placed in XMM0. The expansion is performed by replicating each bit into all of the bits of the byte/word of the same index.

        Specifically for PCMPESTRM/PCMPISTRM, the Imm8 Control Byte bit[6] is used to determine if the mask is a 16 (8) bit mask or a 128 bit byte/word mask.


      6. Valid/Invalid Override of Comparisons

        PCMPxSTRx instructions allow for the possibility that an end-of-string (EOS) situation may occur within the 128-bit packed data value (see the instruction descriptions below for details). Any data elements on either source that are determined to be past the EOS are considered to be invalid, and the treatment of invalid data within a comparison pair varies depending on the aggregation function being performed.

        In general, the individual comparison result for each element pair BoolRes[i.j] can be forced true or false if one or more elements in the pair are invalid. See Table 4-7.


        Table 4-7. Comparison Result for Each Element Pair BoolRes[i.j]

        xmm1 byte/ word

        xmm2/ m128 byte/word

        Imm8[3:2] = 00b

        (equal any)

        Imm8[3:2] = 01b

        (ranges)

        Imm8[3:2] = 10b

        (equal each)

        Imm8[3:2] = 11b

        (equal ordered)

        Invalid

        Invalid

        Force false

        Force false

        Force true

        Force true

        Invalid

        Valid

        Force false

        Force false

        Force false

        Force true

        Valid

        Invalid

        Force false

        Force false

        Force false

        Force false

        Valid

        Valid

        Do not force

        Do not force

        Do not force

        Do not force


      7. Summary of Im8 Control byte

        Table 4-8. Summary of Imm8 Control Byte

        Imm8

        Description

        -------0b

        128-bit sources treated as 16 packed bytes.

        -------1b

        128-bit sources treated as 8 packed words.

        ------0-b

        Packed bytes/words are unsigned.

        ------1-b

        Packed bytes/words are signed.

        ----00--b

        Mode is equal any.

        ----01--b

        Mode is ranges.

        ----10--b

        Mode is equal each.

        ----11--b

        Mode is equal ordered.

        ---0----b

        IntRes1 is unmodified.

        ---1----b

        IntRes1 is negated (1’s complement).

        --0-----b

        Negation of IntRes1 is for all 16 (8) bits.

        --1-----b

        Negation of IntRes1 is masked by reg/mem validity.

        -0------b

        Index of the least significant, set, bit is used (regardless of corresponding input element validity).

        IntRes2 is returned in least significant bits of XMM0.

        -1------b

        Index of the most significant, set, bit is used (regardless of corresponding input element validity).

        Each bit of IntRes2 is expanded to byte/word.

        0-------b

        This bit currently has no defined effect, should be 0.

        1-------b

        This bit currently has no defined effect, should be 0.


      8. Diagram Comparison and Aggregation Process



        image


        Figure 4-1. Operation of PCMPSTRx and PCMPESTRx


    2. COMMON TRANSFORMATION AND PRIMITIVE FUNCTIONS FOR SHA1XXX AND SHA256XXX

      The following primitive functions and transformations are used in the algorithmic descriptions of SHA1 and SHA256 instruction extensions SHA1NEXTE, SHA1RNDS4, SHA1MSG1, SHA1MSG2, SHA256RNDS4, SHA256MSG1 and

      SHA256MSG2. The operands of these primitives and transformation are generally 32-bit DWORD integers.

      • f0(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This function is used in SHA1 round 1 to 20 processing.

        f0(B,C,D) (B AND C) XOR ((NOT(B) AND D)


      • f1(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This function is used in SHA1 round 21 to 40 processing.

        f1(B,C,D) B XOR C XOR D


      • f2(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This function is used in SHA1 round 41 to 60 processing.

        f2(B,C,D) (B AND C) XOR (B AND D) XOR (C AND D)


      • f3(): A bit oriented logical operation that derives a new dword from three SHA1 state variables (dword). This function is used in SHA1 round 61 to 80 processing. It is the same as f1().

        f3(B,C,D) B XOR C XOR D


      • Ch(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword). Ch(E,F,G) (E AND F) XOR ((NOT E) AND G)


      • Maj(): A bit oriented logical operation that derives a new dword from three SHA256 state variables (dword). Maj(A,B,C) (A AND B) XOR (A AND C) XOR (B AND C)


        ROR is rotate right operation

        (A ROR N) A[N-1:0] || A[Width-1:N]


        ROL is rotate left operation

        (A ROL N) A ROR (Width-N)


        SHR is the right shift operation

        (A SHR N) ZEROES[N-1:0] || A[Width-1:N]


      • ( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable.

        (A) (A ROR 2) XOR (A ROR 13) XOR (A ROR 22)


      • ( ): A bit oriented logical and rotational transformation performed on a dword SHA256 state variable.

        (E) (E ROR 6) XOR (E ROR 11) XOR (E ROR 25)


      • ( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the message scheduling.

        (W) (W ROR 7) XOR (W ROR 18) XOR (W SHR 3)


      • ( ): A bit oriented logical and rotational transformation performed on a SHA256 message dword used in the message scheduling.

        (W) (W ROR 17) XOR (W ROR 19) XOR (W SHR 10)


      • Ki: SHA1 Constants dependent on immediate i. K0 = 0x5A827999

        K1 = 0x6ED9EBA1 K2 = 0X8F1BBCDC K3 = 0xCA62C1D6


    3. INSTRUCTIONS (M-U)

Chapter 4 continues an alphabetical discussion of Intel® 64 and IA-32 instructions (M-U). See also: Chapter 3, “Instruction Set Reference, A-L,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A, and Chapter 5, “Instruction Set Reference, V-Z‚” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2C.


MASKMOVDQU—Store Selected Bytes of Double Quadword

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

66 0F F7 /r

MASKMOVDQU xmm1, xmm2

RM

V/V

SSE2

Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI.

VEX.128.66.0F.WIG F7 /r VMASKMOVDQU xmm1, xmm2

RM

V/V

AVX

Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI.


Instruction Operand Encoding1

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (r)

ModRM:r/m (r)

NA

NA


Description

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand (second operand) selects which bytes from the source operand are written to memory. The source and mask oper- ands are XMM registers. The memory location specified by the effective address in the DI/EDI/RDI register (the default segment register is DS, but this may be overridden with a segment-override prefix). The memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the address-size attribute.)

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write.

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The non-temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing opera- tion implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVDQU instructions if multiple processors might use different memory types to read/write the destination memory loca- tions.

Behavior with a mask of all 0s is as follows:

The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byte- by-byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is to be written directly using the byte-mask without allocating old data prior to the store.

In 64-bit mode, the memory address is specified by DS:RDI.



Operation

IF (MASK[7] 1)

THEN DEST[DI/EDI] SRC[7:0] ELSE (* Memory location unchanged *); FI; IF (MASK[15] 1)

THEN DEST[DI/EDI 1] SRC[15:8] ELSE (* Memory location unchanged *); FI; (* Repeat operation for 3rd through 6th bytes in source operand *)

IF (MASK[63] 1)

THEN DEST[DI/EDI 15] SRC[63:56] ELSE (* Memory location unchanged *); FI;


Intel C/C Compiler Intrinsic Equivalent

void _mm_maskmove_si64( m64d, m64n, char * p)


Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.


MAXPD—Maximum of Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 5F /r

MAXPD xmm1, xmm2/m128

A

V/V

SSE2

Return the maximum double-precision floating-point values between xmm1 and xmm2/m128.

VEX.NDS.128.66.0F.WIG 5F /r

VMAXPD xmm1, xmm2, xmm3/m128

B

V/V

AVX

Return the maximum double-precision floating-point values between xmm2 and xmm3/m128.

VEX.NDS.256.66.0F.WIG 5F /r

VMAXPD ymm1, ymm2, ymm3/m256

B

V/V

AVX

Return the maximum packed double-precision floating-point values between ymm2 and ymm3/m256.

EVEX.NDS.128.66.0F.W1 5F /r

VMAXPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Return the maximum packed double-precision floating-point values between xmm2 and xmm3/m128/m64bcst and store result in xmm1 subject to writemask k1.

EVEX.NDS.256.66.0F.W1 5F /r

VMAXPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Return the maximum packed double-precision floating-point values between ymm2 and ymm3/m256/m64bcst and store result in ymm1 subject to writemask k1.

EVEX.NDS.512.66.0F.W1 5F /r

VMAXPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{sae}

C

V/V

AVX512F

Return the maximum packed double-precision floating-point values between zmm2 and zmm3/m512/m64bcst and store result in zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a sequence of instructions, such as a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.



128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.


Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMAXPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] MAX(SRC1[i+63:i], SRC2[63:0]) ELSE

DEST[i+63:i] MAX(SRC1[i+63:i], SRC2[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMAXPD (VEX.256 encoded version)

DEST[63:0] MAX(SRC1[63:0], SRC2[63:0]) DEST[127:64] MAX(SRC1[127:64], SRC2[127:64]) DEST[191:128] MAX(SRC1[191:128], SRC2[191:128]) DEST[255:192] MAX(SRC1[255:192], SRC2[255:192]) DEST[MAXVL-1:256] 0


VMAXPD (VEX.128 encoded version)

DEST[63:0] MAX(SRC1[63:0], SRC2[63:0]) DEST[127:64] MAX(SRC1[127:64], SRC2[127:64]) DEST[MAXVL-1:128] 0



MAXPD (128-bit Legacy SSE version)

DEST[63:0] MAX(DEST[63:0], SRC[63:0]) DEST[127:64] MAX(DEST[127:64], SRC[127:64])

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMAXPD m512d _mm512_max_pd( m512d a, m512d b);

VMAXPD m512d _mm512_mask_max_pd( m512d s, mmask8 k, m512d a, m512d b,); VMAXPD m512d _mm512_maskz_max_pd( mmask8 k, m512d a, m512d b);

VMAXPD m512d _mm512_max_round_pd( m512d a, m512d b, int);

VMAXPD m512d _mm512_mask_max_round_pd( m512d s, mmask8 k, m512d a, m512d b, int); VMAXPD m512d _mm512_maskz_max_round_pd( mmask8 k, m512d a, m512d b, int);

VMAXPD m256d _mm256_mask_max_pd( m5256d s, mmask8 k, m256d a, m256d b); VMAXPD m256d _mm256_maskz_max_pd( mmask8 k, m256d a, m256d b);

VMAXPD m128d _mm_mask_max_pd( m128d s, mmask8 k, m128d a, m128d b); VMAXPD m128d _mm_maskz_max_pd( mmask8 k, m128d a, m128d b);

VMAXPD m256d _mm256_max_pd ( m256d a, m256d b); (V)MAXPD m128d _mm_max_pd ( m128d a, m128d b);


SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MAXPS—Maximum of Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 5F /r

MAXPS xmm1, xmm2/m128

A

V/V

SSE

Return the maximum single-precision floating-point values between xmm1 and xmm2/mem.

VEX.NDS.128.0F.WIG 5F /r

VMAXPS xmm1, xmm2, xmm3/m128

B

V/V

AVX

Return the maximum single-precision floating-point values between xmm2 and xmm3/mem.

VEX.NDS.256.0F.WIG 5F /r

VMAXPS ymm1, ymm2, ymm3/m256

B

V/V

AVX

Return the maximum single-precision floating-point values between ymm2 and ymm3/mem.

EVEX.NDS.128.0F.W0 5F /r

VMAXPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Return the maximum packed single-precision floating-point values between xmm2 and xmm3/m128/m32bcst and store result in xmm1 subject to writemask k1.

EVEX.NDS.256.0F.W0 5F /r

VMAXPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Return the maximum packed single-precision floating-point values between ymm2 and ymm3/m256/m32bcst and store result in ymm1 subject to writemask k1.

EVEX.NDS.512.0F.W0 5F /r

VMAXPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{sae}

C

V/V

AVX512F

Return the maximum packed single-precision floating-point values between zmm2 and zmm3/m512/m32bcst and store result in zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.



Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMAXPS (EVEX encoded versions)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] MAX(SRC1[i+31:i], SRC2[31:0]) ELSE

DEST[i+31:i] MAX(SRC1[i+31:i], SRC2[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMAXPS (VEX.256 encoded version) DEST[31:0] MAX(SRC1[31:0], SRC2[31:0]) DEST[63:32] MAX(SRC1[63:32], SRC2[63:32]) DEST[95:64] MAX(SRC1[95:64], SRC2[95:64])

DEST[127:96] MAX(SRC1[127:96], SRC2[127:96]) DEST[159:128] MAX(SRC1[159:128], SRC2[159:128]) DEST[191:160] MAX(SRC1[191:160], SRC2[191:160]) DEST[223:192] MAX(SRC1[223:192], SRC2[223:192]) DEST[255:224] MAX(SRC1[255:224], SRC2[255:224]) DEST[MAXVL-1:256] 0


VMAXPS (VEX.128 encoded version) DEST[31:0] MAX(SRC1[31:0], SRC2[31:0]) DEST[63:32] MAX(SRC1[63:32], SRC2[63:32]) DEST[95:64] MAX(SRC1[95:64], SRC2[95:64])

DEST[127:96] MAX(SRC1[127:96], SRC2[127:96]) DEST[MAXVL-1:128] 0



MAXPS (128-bit Legacy SSE version)

DEST[31:0] MAX(DEST[31:0], SRC[31:0])

DEST[63:32] MAX(DEST[63:32], SRC[63:32])

DEST[95:64] MAX(DEST[95:64], SRC[95:64]) DEST[127:96] MAX(DEST[127:96], SRC[127:96])

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMAXPS m512 _mm512_max_ps( m512 a, m512 b);

VMAXPS m512 _mm512_mask_max_ps( m512 s, mmask16 k, m512 a, m512 b); VMAXPS m512 _mm512_maskz_max_ps( mmask16 k, m512 a, m512 b); VMAXPS m512 _mm512_max_round_ps( m512 a, m512 b, int);

VMAXPS m512 _mm512_mask_max_round_ps( m512 s, mmask16 k, m512 a, m512 b, int); VMAXPS m512 _mm512_maskz_max_round_ps( mmask16 k, m512 a, m512 b, int);

VMAXPS m256 _mm256_mask_max_ps( m256 s, mmask8 k, m256 a, m256 b); VMAXPS m256 _mm256_maskz_max_ps( mmask8 k, m256 a, m256 b);

VMAXPS m128 _mm_mask_max_ps( m128 s, mmask8 k, m128 a, m128 b); VMAXPS m128 _mm_maskz_max_ps( mmask8 k, m128 a, m128 b); VMAXPS m256 _mm256_max_ps ( m256 a, m256 b);

MAXPS m128 _mm_max_ps ( m128 a, m128 b);


SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MAXSD—Return Maximum Scalar Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F2 0F 5F /r

MAXSD xmm1, xmm2/m64

A

V/V

SSE2

Return the maximum scalar double-precision floating-point value between xmm2/m64 and xmm1.

VEX.NDS.LIG.F2.0F.WIG 5F /r

VMAXSD xmm1, xmm2, xmm3/m64

B

V/V

AVX

Return the maximum scalar double-precision floating-point value between xmm3/m64 and xmm2.

EVEX.NDS.LIG.F2.0F.W1 5F /r

VMAXSD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}

C

V/V

AVX512F

Return the maximum scalar double-precision floating-point value between xmm3/m64 and xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Compares the low double-precision floating-point values in the first source operand and the second source operand, and returns the maximum value to the low quadword of the destination operand. The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers. When the second source operand is a memory operand, only 64 bits are accessed.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the writemask.

Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMAXSD (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[63:0] MAX(SRC1[63:0], SRC2[63:0]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

DEST[63:0] 0

FI;

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMAXSD (VEX.128 encoded version) DEST[63:0] MAX(SRC1[63:0], SRC2[63:0]) DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


MAXSD (128-bit Legacy SSE version)

DEST[63:0] MAX(DEST[63:0], SRC[63:0])

DEST[MAXVL-1:64] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMAXSD m128d _mm_max_round_sd( m128d a, m128d b, int);

VMAXSD m128d _mm_mask_max_round_sd( m128d s, mmask8 k, m128d a, m128d b, int); VMAXSD m128d _mm_maskz_max_round_sd( mmask8 k, m128d a, m128d b, int);

MAXSD m128d _mm_max_sd( m128d a, m128d b)


SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3. EVEX-encoded instruction, see Exceptions Type E3.


MAXSS—Return Maximum Scalar Single-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 5F /r

MAXSS xmm1, xmm2/m32

A

V/V

SSE

Return the maximum scalar single-precision floating-point value between xmm2/m32 and xmm1.

VEX.NDS.LIG.F3.0F.WIG 5F /r

VMAXSS xmm1, xmm2, xmm3/m32

B

V/V

AVX

Return the maximum scalar single-precision floating-point value between xmm3/m32 and xmm2.

EVEX.NDS.LIG.F3.0F.W0 5F /r

VMAXSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}

C

V/V

AVX512F

Return the maximum scalar single-precision floating-point value between xmm3/m32 and xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Compares the low single-precision floating-point values in the first source operand and the second source operand, and returns the maximum value to the low doubleword of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre- sponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL:128) of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the writemask.

Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

MAX(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 > SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMAXSS (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[31:0] MAX(SRC1[31:0], SRC2[31:0]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0] 0

FI;

FI;

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


VMAXSS (VEX.128 encoded version) DEST[31:0] MAX(SRC1[31:0], SRC2[31:0]) DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


MAXSS (128-bit Legacy SSE version)

DEST[31:0] MAX(DEST[31:0], SRC[31:0])

DEST[MAXVL-1:32] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMAXSS m128 _mm_max_round_ss( m128 a, m128 b, int);

VMAXSS m128 _mm_mask_max_round_ss( m128 s, mmask8 k, m128 a, m128 b, int); VMAXSS m128 _mm_maskz_max_round_ss( mmask8 k, m128 a, m128 b, int);

MAXSS m128 _mm_max_ss( m128 a, m128 b)


SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3. EVEX-encoded instruction, see Exceptions Type E3.


image

MFENCE—Memory Fence

Opcode

Instruction

Op/ 64-Bit

En Mode

Compat/ Description Leg Mode


NP 0F AE F0 MFENCE ZO Valid Valid Serializes load and store operations.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA


Description

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.1 The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID

instruction). MFENCE does not serialize the instruction stream.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store ordering between routines that produce weakly-ordered results and routines that consume that data.

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the MFENCE instruction; data can be brought into the caches specula- tively just before, during, or after the execution of an MFENCE instruction.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Specification of the instruction's opcode above indicates a ModR/M byte of F0. For this instruction, the processor ignores the r/m field of the ModR/M byte. Thus, MFENCE is encoded by any opcode of the form 0F AE Fx, where x is in the range 0-7.


Operation

Wait_On_Following_Loads_And_Stores_Until(preceding_loads_and_stores_globally_visible);


Intel C/C Compiler Intrinsic Equivalent

void _mm_mfence(void)


Exceptions (All Modes of Operation)

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.


image

1. A load instruction is considered to become globally visible when the value to be loaded into its destination register is determined.


MINPD—Minimum of Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 5D /r

MINPD xmm1, xmm2/m128

A

V/V

SSE2

Return the minimum double-precision floating-point values between xmm1 and xmm2/mem

VEX.NDS.128.66.0F.WIG 5D /r

VMINPD xmm1, xmm2, xmm3/m128

B

V/V

AVX

Return the minimum double-precision floating-point values between xmm2 and xmm3/mem.

VEX.NDS.256.66.0F.WIG 5D /r

VMINPD ymm1, ymm2, ymm3/m256

B

V/V

AVX

Return the minimum packed double-precision floating-point values between ymm2 and ymm3/mem.

EVEX.NDS.128.66.0F.W1 5D /r

VMINPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Return the minimum packed double-precision floating-point values between xmm2 and xmm3/m128/m64bcst and store result in xmm1 subject to writemask k1.

EVEX.NDS.256.66.0F.W1 5D /r

VMINPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Return the minimum packed double-precision floating-point values between ymm2 and ymm3/m256/m64bcst and store result in ymm1 subject to writemask k1.

EVEX.NDS.512.66.0F.W1 5D /r

VMINPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{sae}

C

V/V

AVX512F

Return the minimum packed double-precision floating-point values between zmm2 and zmm3/m512/m64bcst and store result in zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Performs a SIMD compare of the packed double-precision floating-point values in the first source operand and the second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.



Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMINPD (EVEX encoded version)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] MIN(SRC1[i+63:i], SRC2[63:0]) ELSE

DEST[i+63:i] MIN(SRC1[i+63:i], SRC2[i+63:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMINPD (VEX.256 encoded version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0]) DEST[127:64] MIN(SRC1[127:64], SRC2[127:64]) DEST[191:128] MIN(SRC1[191:128], SRC2[191:128]) DEST[255:192] MIN(SRC1[255:192], SRC2[255:192])


VMINPD (VEX.128 encoded version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0]) DEST[127:64] MIN(SRC1[127:64], SRC2[127:64]) DEST[MAXVL-1:128] 0


MINPD (128-bit Legacy SSE version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0]) DEST[127:64] MIN(SRC1[127:64], SRC2[127:64])

DEST[MAXVL-1:128] (Unmodified)



Intel C/C++ Compiler Intrinsic Equivalent

VMINPD m512d _mm512_min_pd( m512d a, m512d b);

VMINPD m512d _mm512_mask_min_pd( m512d s, mmask8 k, m512d a, m512d b); VMINPD m512d _mm512_maskz_min_pd( mmask8 k, m512d a, m512d b);

VMINPD m512d _mm512_min_round_pd( m512d a, m512d b, int);

VMINPD m512d _mm512_mask_min_round_pd( m512d s, mmask8 k, m512d a, m512d b, int); VMINPD m512d _mm512_maskz_min_round_pd( mmask8 k, m512d a, m512d b, int);

VMINPD m256d _mm256_mask_min_pd( m256d s, mmask8 k, m256d a, m256d b); VMINPD m256d _mm256_maskz_min_pd( mmask8 k, m256d a, m256d b);

VMINPD m128d _mm_mask_min_pd( m128d s, mmask8 k, m128d a, m128d b); VMINPD m128d _mm_maskz_min_pd( mmask8 k, m128d a, m128d b);

VMINPD m256d _mm256_min_pd ( m256d a, m256d b); MINPD m128d _mm_min_pd ( m128d a, m128d b);


SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MINPS—Minimum of Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 5D /r

MINPS xmm1, xmm2/m128

A

V/V

SSE

Return the minimum single-precision floating-point values between xmm1 and xmm2/mem.

VEX.NDS.128.0F.WIG 5D /r

VMINPS xmm1, xmm2, xmm3/m128

B

V/V

AVX

Return the minimum single-precision floating-point values between xmm2 and xmm3/mem.

VEX.NDS.256.0F.WIG 5D /r

VMINPS ymm1, ymm2, ymm3/m256

B

V/V

AVX

Return the minimum single double-precision floating-point values between ymm2 and ymm3/mem.

EVEX.NDS.128.0F.W0 5D /r

VMINPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Return the minimum packed single-precision floating-point values between xmm2 and xmm3/m128/m32bcst and store result in xmm1 subject to writemask k1.

EVEX.NDS.256.0F.W0 5D /r

VMINPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Return the minimum packed single-precision floating-point values between ymm2 and ymm3/m256/m32bcst and store result in ymm1 subject to writemask k1.

EVEX.NDS.512.0F.W0 5D /r

VMINPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst{sae}

C

V/V

AVX512F

Return the minimum packed single-precision floating-point values between zmm2 and zmm3/m512/m32bcst and store result in zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Performs a SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.



Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


VMINPS (EVEX encoded version)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] MIN(SRC1[i+31:i], SRC2[31:0]) ELSE

DEST[i+31:i] MIN(SRC1[i+31:i], SRC2[i+31:i])

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMINPS (VEX.256 encoded version) DEST[31:0] MIN(SRC1[31:0], SRC2[31:0]) DEST[63:32] MIN(SRC1[63:32], SRC2[63:32]) DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96]) DEST[159:128] MIN(SRC1[159:128], SRC2[159:128]) DEST[191:160] MIN(SRC1[191:160], SRC2[191:160]) DEST[223:192] MIN(SRC1[223:192], SRC2[223:192]) DEST[255:224] MIN(SRC1[255:224], SRC2[255:224])


VMINPS (VEX.128 encoded version) DEST[31:0] MIN(SRC1[31:0], SRC2[31:0]) DEST[63:32] MIN(SRC1[63:32], SRC2[63:32]) DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96]) DEST[MAXVL-1:128] 0



MINPS (128-bit Legacy SSE version) DEST[31:0] MIN(SRC1[31:0], SRC2[31:0]) DEST[63:32] MIN(SRC1[63:32], SRC2[63:32]) DEST[95:64] MIN(SRC1[95:64], SRC2[95:64])

DEST[127:96] MIN(SRC1[127:96], SRC2[127:96])

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMINPS m512 _mm512_min_ps( m512 a, m512 b);

VMINPS m512 _mm512_mask_min_ps( m512 s, mmask16 k, m512 a, m512 b); VMINPS m512 _mm512_maskz_min_ps( mmask16 k, m512 a, m512 b);

VMINPS m512 _mm512_min_round_ps( m512 a, m512 b, int);

VMINPS m512 _mm512_mask_min_round_ps( m512 s, mmask16 k, m512 a, m512 b, int); VMINPS m512 _mm512_maskz_min_round_ps( mmask16 k, m512 a, m512 b, int);

VMINPS m256 _mm256_mask_min_ps( m256 s, mmask8 k, m256 a, m256 b); VMINPS m256 _mm256_maskz_min_ps( mmask8 k, m256 a, m25 b);

VMINPS m128 _mm_mask_min_ps( m128 s, mmask8 k, m128 a, m128 b); VMINPS m128 _mm_maskz_min_ps( mmask8 k, m128 a, m128 b);

VMINPS m256 _mm256_min_ps ( m256 a, m256 b); MINPS m128 _mm_min_ps ( m128 a, m128 b);


SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MINSD—Return Minimum Scalar Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F2 0F 5D /r

MINSD xmm1, xmm2/m64

A

V/V

SSE2

Return the minimum scalar double-precision floating- point value between xmm2/m64 and xmm1.

VEX.NDS.LIG.F2.0F.WIG 5D /r

VMINSD xmm1, xmm2, xmm3/m64

B

V/V

AVX

Return the minimum scalar double-precision floating- point value between xmm3/m64 and xmm2.

EVEX.NDS.LIG.F2.0F.W1 5D /r

VMINSD xmm1 {k1}{z}, xmm2, xmm3/m64{sae}

C

V/V

AVX512F

Return the minimum scalar double-precision floating- point value between xmm3/m64 and xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Compares the low double-precision floating-point values in the first source operand and the second source operand, and returns the minimum value to the low quadword of the destination operand. When the source operand is a memory operand, only the 64 bits are accessed.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the writemask.

Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


MINSD (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[63:0] MIN(SRC1[63:0], SRC2[63:0]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0] 0

FI;

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


MINSD (VEX.128 encoded version) DEST[63:0] MIN(SRC1[63:0], SRC2[63:0]) DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


MINSD (128-bit Legacy SSE version)

DEST[63:0] MIN(SRC1[63:0], SRC2[63:0])

DEST[MAXVL-1:64] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMINSD m128d _mm_min_round_sd( m128d a, m128d b, int);

VMINSD m128d _mm_mask_min_round_sd( m128d s, mmask8 k, m128d a, m128d b, int); VMINSD m128d _mm_maskz_min_round_sd( mmask8 k, m128d a, m128d b, int);

MINSD m128d _mm_min_sd( m128d a, m128d b)


SIMD Floating-Point Exceptions

Invalid (including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3. EVEX-encoded instruction, see Exceptions Type E3.


MINSS—Return Minimum Scalar Single-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 5D /r

MINSS xmm1,xmm2/m32

A

V/V

SSE

Return the minimum scalar single-precision floating- point value between xmm2/m32 and xmm1.

VEX.NDS.LIG.F3.0F.WIG 5D /r

VMINSS xmm1,xmm2, xmm3/m32

B

V/V

AVX

Return the minimum scalar single-precision floating- point value between xmm3/m32 and xmm2.

EVEX.NDS.LIG.F3.0F.W0 5D /r

VMINSS xmm1 {k1}{z}, xmm2, xmm3/m32{sae}

C

V/V

AVX512F

Return the minimum scalar single-precision floating- point value between xmm3/m32 and xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA


Description

Compares the low single-precision floating-point values in the first source operand and the second source operand and returns the minimum value to the low doubleword of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre- sponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by (E)VEX.vvvv. Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the writemask.

Software should ensure VMINSS is encoded with VEX.L=0. Encoding VMINSS with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

MIN(SRC1, SRC2)

{

IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST SRC2; ELSE IF (SRC1 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC2 = SNaN) THEN DEST SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST SRC1;

ELSE DEST SRC2;

FI;

}


MINSS (EVEX encoded version)

IF k1[0] or *no writemask*

THEN DEST[31:0] MIN(SRC1[31:0], SRC2[31:0]) ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0] 0

FI;

FI;

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


VMINSS (VEX.128 encoded version) DEST[31:0] MIN(SRC1[31:0], SRC2[31:0]) DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


MINSS (128-bit Legacy SSE version)

DEST[31:0] MIN(SRC1[31:0], SRC2[31:0])

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMINSS m128 _mm_min_round_ss( m128 a, m128 b, int);

VMINSS m128 _mm_mask_min_round_ss( m128 s, mmask8 k, m128 a, m128 b, int); VMINSS m128 _mm_maskz_min_round_ss( mmask8 k, m128 a, m128 b, int);

MINSS m128 _mm_min_ss( m128 a, m128 b)


SIMD Floating-Point Exceptions

Invalid (Including QNaN Source Operand), Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


image

Compat/ Description Leg Mode

Valid

Op/ 64-Bit

En Mode

Valid

MONITOR—Set Up Monitor Address

Opcode

Instruction


0F 01 C8

MONITOR

ZO

Sets up a linear address range to be monitored by hardware and activates the monitor. The address range should be a write- back memory caching type. The address is DS:RAX/EAX/AX.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA


Description

The MONITOR instruction arms address monitoring hardware using an address specified in EAX (the address range that the monitoring hardware checks for store operations can be determined by using CPUID). A store to an address within the specified address range triggers the monitoring hardware. The state of monitor hardware is used by MWAIT.

The address is specified in RAX/EAX/AX and the size is based on the effective address size of the encoded instruc- tion. By default, the DS segment is used to create a linear address that is monitored. Segment overrides can be used.

ECX and EDX are also used. They communicate other information to MONITOR. ECX specifies optional extensions. EDX specifies optional hints; it does not change the architectural behavior of the instruction. For the Pentium 4 processor (family 15, model 3), no extensions or hints are defined. Undefined hints in EDX are ignored by the processor; undefined extensions in ECX raises a general protection fault.

The address range must use memory of the write-back type. Only write-back memory will correctly trigger the monitoring hardware. Additional information on determining what address range to use in order to prevent false wake-ups is described in Chapter 8, “Multiple-Processor Management” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction is subject to the permission checking and faults associated with a byte load. Like a load, MONITOR sets the A-bit but not the D-bit in page tables.

CPUID.01H:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set, MONITOR may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE MSR; disabling MONITOR clears the CPUID feature flag and causes execution to generate an invalid-opcode excep- tion.

The instruction’s operation is the same in non-64-bit modes and 64-bit mode.


Operation

MONITOR sets up an address range for the monitor hardware using the content of EAX (RAX in 64-bit mode) as an effective address and puts the monitor hardware in armed state. Always use memory of the write-back caching type. A store to the specified address range will trigger the monitor hardware. The content of ECX and EDX are used to communicate other information to the monitor hardware.


Intel C/C Compiler Intrinsic Equivalent

MONITOR: void _mm_monitor(void const *p, unsigned extensions,unsigned hints)


Numeric Exceptions

None


MONITOR—Set Up Monitor Address Vol. 2B 4-33



Protected Mode Exceptions

#GP(0) If the value in EAX is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register is used to access memory and it contains a NULL segment selector.

If ECX 0.

#SS(0) If the value in EAX is outside the SS segment limit.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

If current privilege level is not 0.


Real Address Mode Exceptions

#GP If the CS, DS, ES, FS, or GS register is used to access memory and the value in EAX is outside of the effective address space from 0 to FFFFH.

If ECX 0.

#SS If the SS register is used to access memory and the value in EAX is outside of the effective address space from 0 to FFFFH.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.


Virtual 8086 Mode Exceptions

#UD The MONITOR instruction is not recognized in virtual-8086 mode (even if CPUID.01H:ECX.MONITOR[bit 3] = 1).


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#GP(0) If the linear address of the operand in the CS, DS, ES, FS, or GS segment is in a non-canonical form.

If RCX 0.

#SS(0) If the SS register is used to access memory and the value in EAX is in a non-canonical form.

#PF(fault-code) For a page fault.

#UD If the current privilege level is not 0.

If CPUID.01H:ECX.MONITOR[bit 3] = 0.


MOV—Move


Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

88 /r

MOV r/m8,r8

MR

Valid

Valid

Move r8 to r/m8.

REX + 88 /r

MOV r/m8***,r8***

MR

Valid

N.E.

Move r8 to r/m8.

89 /r

MOV r/m16,r16

MR

Valid

Valid

Move r16 to r/m16.

89 /r

MOV r/m32,r32

MR

Valid

Valid

Move r32 to r/m32.

REX.W + 89 /r

MOV r/m64,r64

MR

Valid

N.E.

Move r64 to r/m64.

8A /r

MOV r8,r/m8

RM

Valid

Valid

Move r/m8 to r8.

REX + 8A /r

MOV r8***,r/m8***

RM

Valid

N.E.

Move r/m8 to r8.

8B /r

MOV r16,r/m16

RM

Valid

Valid

Move r/m16 to r16.

8B /r

MOV r32,r/m32

RM

Valid

Valid

Move r/m32 to r32.

REX.W + 8B /r

MOV r64,r/m64

RM

Valid

N.E.

Move r/m64 to r64.

8C /r

MOV r/m16,Sreg**

MR

Valid

Valid

Move segment register to r/m16.

REX.W + 8C /r

MOV r16/r32/m16, Sreg**

MR

Valid

Valid

Move zero extended 16-bit segment register to r16/r32/r64/m16.

REX.W + 8C /r

MOV r64/m16, Sreg**

MR

Valid

Valid

Move zero extended 16-bit segment register to r64/m16.

8E /r

MOV Sreg,r/m16**

RM

Valid

Valid

Move r/m16 to segment register.

REX.W + 8E /r

MOV Sreg,r/m64**

RM

Valid

Valid

Move lower 16 bits of r/m64 to segment register.

A0

MOV AL,moffs8*

FD

Valid

Valid

Move byte at (seg:offset) to AL.

REX.W + A0

MOV AL,moffs8*

FD

Valid

N.E.

Move byte at (offset) to AL.

A1

MOV AX,moffs16*

FD

Valid

Valid

Move word at (seg:offset) to AX.

A1

MOV EAX,moffs32*

FD

Valid

Valid

Move doubleword at (seg:offset) to EAX.

REX.W + A1

MOV RAX,moffs64*

FD

Valid

N.E.

Move quadword at (offset) to RAX.

A2

MOV moffs8,AL

TD

Valid

Valid

Move AL to (seg:offset).

REX.W + A2

MOV moffs8***,AL

TD

Valid

N.E.

Move AL to (offset).

A3

MOV moffs16*,AX

TD

Valid

Valid

Move AX to (seg:offset).

A3

MOV moffs32*,EAX

TD

Valid

Valid

Move EAX to (seg:offset).

REX.W + A3

MOV moffs64*,RAX

TD

Valid

N.E.

Move RAX to (offset).

B0+ rb ib

MOV r8, imm8

OI

Valid

Valid

Move imm8 to r8.

REX + B0+ rb ib

MOV r8***, imm8

OI

Valid

N.E.

Move imm8 to r8.

B8+ rw iw

MOV r16, imm16

OI

Valid

Valid

Move imm16 to r16.

B8+ rd id

MOV r32, imm32

OI

Valid

Valid

Move imm32 to r32.

REX.W + B8+ rd io

MOV r64, imm64

OI

Valid

N.E.

Move imm64 to r64.

C6 /0 ib

MOV r/m8, imm8

MI

Valid

Valid

Move imm8 to r/m8.

REX + C6 /0 ib

MOV r/m8***, imm8

MI

Valid

N.E.

Move imm8 to r/m8.

C7 /0 iw

MOV r/m16, imm16

MI

Valid

Valid

Move imm16 to r/m16.

C7 /0 id

MOV r/m32, imm32

MI

Valid

Valid

Move imm32 to r/m32.

REX.W + C7 /0 id

MOV r/m64, imm32

MI

Valid

N.E.

Move imm32 sign extended to 64-bits to

r/m64.


NOTES:

* The moffs8, moffs16, moffs32 and moffs64 operands specify a simple offset relative to the segment base, where 8, 16, 32 and 64 refer to the size of the data. The address-size attribute of the instruction determines the size of the offset, either 16, 32 or 64 bits.

** In 32-bit mode, the assembler may insert the 16-bit operand-size prefix with this instruction (see the following “Description” sec- tion for further information).

***In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.



Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

FD

AL/AX/EAX/RAX

Moffs

NA

NA

TD

Moffs (w)

AL/AX/EAX/RAX

NA

NA

OI

opcode + rd (w)

imm8/16/32/64

NA

NA

MI

ModRM:r/m (w)

imm8/16/32/64

NA

NA


Description

Copies the second operand (source operand) to the first operand (destination operand). The source operand can be an immediate value, general-purpose register, segment register, or memory location; the destination register can be a general-purpose register, segment register, or memory location. Both operands must be the same size, which can be a byte, a word, a doubleword, or a quadword.

The MOV instruction cannot be used to load the CS register. Attempting to do so results in an invalid opcode excep- tion (#UD). To load the CS register, use the far JMP, CALL, or RET instruction.

If the destination operand is a segment register (DS, ES, FS, GS, or SS), the source operand must be a valid segment selector. In protected mode, moving a segment selector into a segment register automatically causes the segment descriptor information associated with that segment selector to be loaded into the hidden (shadow) part of the segment register. While loading this information, the segment selector and segment descriptor information is validated (see the “Operation” algorithm below). The segment descriptor data is obtained from the GDT or LDT entry for the specified segment selector.

A NULL segment selector (values 0000-0003) can be loaded into the DS, ES, FS, and GS registers without causing a protection exception. However, any subsequent attempt to reference a segment whose corresponding segment register is loaded with a NULL value causes a general protection exception (#GP) and no memory reference occurs.

Loading the SS register with a MOV instruction suppresses or inhibits some debug exceptions and inhibits inter- rupts on the following instruction boundary. (The inhibition ends after delivery of an exception or the execution of the next instruction.) This behavior allows a stack pointer to be loaded into the ESP register with the next instruc- tion (MOV ESP, stack-pointer value) before an event can be delivered. See Section 6.8.3, “Masking Exceptions and Interrupts When Switching Stacks,” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. Intel recommends that software use the LSS instruction to load the SS register and ESP together.

When executing MOV Reg, Sreg, the processor copies the content of Sreg to the 16 least significant bits of the general-purpose register. The upper bits of the destination register are zero for most IA-32 processors (Pentium Pro processors and later) and all Intel 64 processors, with the exception that bits 31:16 are undefined for Intel Quark X1000 processors, Pentium and earlier processors.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.



Operation

DEST SRC;

Loading a segment register while in protected mode results in special checks and actions, as described in the following listing. These checks are performed on the segment selector and the segment descriptor to which it points.

IF SS is loaded THEN

IF segment selector is NULL THEN #GP(0); FI;

IF segment selector index is outside descriptor table limits OR segment selector's RPL CPL

OR segment is not a writable data segment OR DPL CPL

THEN #GP(selector); FI;

IF segment not marked present THEN #SS(selector);

ELSE

SS segment selector;

SS segment descriptor; FI;

FI;

IF DS, ES, FS, or GS is loaded with non-NULL selector THEN

IF segment selector index is outside descriptor table limits OR segment is not a data or readable code segment

OR ((segment is a data or nonconforming code segment) AND ((RPL DPL) or (CPL DPL))) THEN #GP(selector); FI;

IF segment not marked present THEN #NP(selector);

ELSE

SegmentRegister segment selector; SegmentRegister segment descriptor; FI;

FI;

IF DS, ES, FS, or GS is loaded with NULL selector THEN

SegmentRegister segment selector; SegmentRegister segment descriptor;

FI;


Flags Affected

None



Protected Mode Exceptions

#GP(0) If attempt is made to load SS register with NULL segment selector.

If the destination operand is in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector.

#GP(selector) If segment selector index is outside descriptor table limits.

If the SS register is being loaded and the segment selector's RPL and the segment descriptor’s DPL are not equal to the CPL.

If the SS register is being loaded and the segment pointed to is a non-writable data segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or readable code segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or nonconforming code segment, and either the RPL or the CPL is greater than the DPL.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#SS(selector) If the SS register is being loaded and the segment pointed to is marked not present.

#NP If the DS, ES, FS, or GS register is being loaded and the segment pointed to is marked not present.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.



64-Bit Mode Exceptions

#GP(0) If the memory address is in a non-canonical form.

If an attempt is made to load SS register with NULL segment selector when CPL = 3.

If an attempt is made to load SS register with NULL segment selector when CPL < 3 and CPL

RPL.

#GP(selector) If segment selector index is outside descriptor table limits.

If the memory access to the descriptor table is non-canonical.

If the SS register is being loaded and the segment selector's RPL and the segment descriptor’s DPL are not equal to the CPL.

If the SS register is being loaded and the segment pointed to is a nonwritable data segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is not a data or readable code segment.

If the DS, ES, FS, or GS register is being loaded and the segment pointed to is a data or nonconforming code segment, but both the RPL and the CPL are greater than the DPL.

#SS(0) If the stack address is in a non-canonical form.

#SS(selector) If the SS register is being loaded and the segment pointed to is marked not present.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If attempt is made to load the CS register.

If the LOCK prefix is used.


MOV—Move to/from Control Registers


Opcode/ Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F 20/r

MOV r32, CR0–CR7

MR

N.E.

Valid

Move control register to r32.

0F 20/r

MOV r64, CR0–CR7

MR

Valid

N.E.

Move extended control register to r64.

REX.R + 0F 20 /0

MOV r64, CR8

MR

Valid

N.E.

Move extended CR8 to r64.1

0F 22 /r

MOV CR0–CR7, r32

RM

N.E.

Valid

Move r32 to control register.

0F 22 /r

MOV CR0–CR7, r64

RM

Valid

N.E.

Move r64 to extended control register.

REX.R + 0F 22 /0

MOV CR8, r64

RM

Valid

N.E.

Move r64 to extended CR8.1

NOTE:

1. MOV CR* instructions, except for MOV CR8, are serializing instructions. MOV CR8 is not

architecturally defined as a serializing instruction. For more information, see Chapter 8 in Intel® 64 and IA-32 Architectures Soft- ware Developer’s Manual, Volume 3A.



Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Moves the contents of a control register (CR0, CR2, CR3, CR4, or CR8) to a general-purpose register or the contents of a general purpose register to a control register. The operand size for these instructions is always 32 bits in non-64-bit modes, regardless of the operand-size attribute. (See “Control Registers” in Chapter 2 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for a detailed description of the flags and fields in the control registers.) This instruction can be executed only when the current privilege level is 0.

At the opcode level, the reg field within the ModR/M byte specifies which of the control registers is loaded or read. The 2 bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read.

Attempts to reference CR1, CR5, CR6, CR7, and CR9–CR15 result in undefined opcode (#UD) exceptions.

When loading control registers, programs should not attempt to change the reserved bits; that is, always set reserved bits to the value previously read. An attempt to change CR4's reserved bits will cause a general protection fault. Reserved bits in CR0 and CR3 remain clear after any load of those registers; attempts to set them have no impact. On Pentium 4, Intel Xeon and P6 family processors, CR0.ET remains set after any load of CR0; attempts to clear this bit have no impact.

In certain cases, these instructions have the side effect of invalidating entries in the TLBs and the paging-structure caches. See Section 4.10.4.1, “Operations that Invalidate TLBs and Paging-Structure Caches,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A for details.

The following side effects are implementation-specific for the Pentium 4, Intel Xeon, and P6 processor family: when modifying PE or PG in register CR0, or PSE or PAE in register CR4, all TLB entries are flushed, including global entries. Software should not depend on this functionality in all Intel 64 or IA-32 processors.

In 64-bit mode, the instruction’s default operation size is 64 bits. The REX.R prefix must be used to access CR8. Use of REX.B permits access to additional registers (R8-R15). Use of the REX.W prefix or 66H prefix is ignored. Use of



the REX.R prefix to specify a register other than CR8 causes an invalid-opcode exception. See the summary chart at the beginning of this section for encoding data and limits.

If CR4.PCIDE = 1, bit 63 of the source operand to MOV to CR3 determines whether the instruction invalidates entries in the TLBs and the paging-structure caches (see Section 4.10.4.1, “Operations that Invalidate TLBs and Paging-Structure Caches,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). The instruction does not modify bit 63 of CR3, which is reserved and always 0.

See “Changes to Instruction Behavior in VMX Non-Root Operation” in Chapter 25 of the Intel® 64 and IA-32 Archi- tectures Software Developer’s Manual, Volume 3C, for more information about the behavior of this instruction in VMX non-root operation.


Operation

DEST SRC;


Flags Affected

The OF, SF, ZF, AF, PF, and CF flags are undefined.


Protected Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1 when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to write a 1 to any reserved bit in CR4. If an attempt is made to write 1 to CR4.PCIDE.

If any of the reserved bits are set in the page-directory pointers table (PDPT) and the loading of a control register causes the PDPT to be loaded into the processor.

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.


Real-Address Mode Exceptions

#GP If an attempt is made to write a 1 to any reserved bit in CR4.

If an attempt is made to write 1 to CR4.PCIDE.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1 when the PE flag is set to 0).

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.


Virtual-8086 Mode Exceptions

#GP(0) These instructions cannot be executed in virtual-8086 mode.


Compatibility Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1 when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] 000H. If an attempt is made to clear CR0.PG[bit 31] while CR4.PCIDE = 1.

If an attempt is made to write a 1 to any reserved bit in CR3.

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5].

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.



64-Bit Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write invalid bit combinations in CR0 (such as setting the PG flag to 1 when the PE flag is set to 0, or setting the CD flag to 0 when the NW flag is set to 1).

If an attempt is made to change CR4.PCIDE from 0 to 1 while CR3[11:0] 000H. If an attempt is made to clear CR0.PG[bit 31].

If an attempt is made to write a 1 to any reserved bit in CR4. If an attempt is made to write a 1 to any reserved bit in CR8. If an attempt is made to write a 1 to any reserved bit in CR3.

If an attempt is made to leave IA-32e mode by clearing CR4.PAE[bit 5].

#UD If the LOCK prefix is used.

If an attempt is made to access CR1, CR5, CR6, or CR7.

If the REX.R prefix is used to specify a register other than CR8.


MOV—Move to/from Debug Registers

Opcode/ Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F 21/r

MOV r32, DR0–DR7

MR

N.E.

Valid

Move debug register to r32.

0F 21/r

MOV r64, DR0–DR7

MR

Valid

N.E.

Move extended debug register to r64.

0F 23 /r

MOV DR0–DR7, r32

RM

N.E.

Valid

Move r32 to debug register.

0F 23 /r

MOV DR0–DR7, r64

RM

Valid

N.E.

Move r64 to extended debug register.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Moves the contents of a debug register (DR0, DR1, DR2, DR3, DR4, DR5, DR6, or DR7) to a general-purpose register or vice versa. The operand size for these instructions is always 32 bits in non-64-bit modes, regardless of the operand-size attribute. (See Section 17.2, “Debug Registers”, of the Intel® 64 and IA-32 Architectures Soft- ware Developer’s Manual, Volume 3A, for a detailed description of the flags and fields in the debug registers.)

The instructions must be executed at privilege level 0 or in real-address mode.

When the debug extension (DE) flag in register CR4 is clear, these instructions operate on debug registers in a manner that is compatible with Intel386 and Intel486 processors. In this mode, references to DR4 and DR5 refer to DR6 and DR7, respectively. When the DE flag in CR4 is set, attempts to reference DR4 and DR5 result in an undefined opcode (#UD) exception. (The CR4 register was added to the IA-32 Architecture beginning with the Pentium processor.)

At the opcode level, the reg field within the ModR/M byte specifies which of the debug registers is loaded or read. The two bits in the mod field are ignored. The r/m field specifies the general-purpose register loaded or read.

In 64-bit mode, the instruction’s default operation size is 64 bits. Use of the REX.B prefix permits access to addi- tional registers (R8–R15). Use of the REX.W or 66H prefix is ignored. Use of the REX.R prefix causes an invalid- opcode exception. See the summary chart at the beginning of this section for encoding data and limits.


Operation

IF ((DE 1) and (SRC or DEST DR4 or DR5)) THEN

#UD; ELSE

DEST SRC;

FI;


Flags Affected

The OF, SF, ZF, AF, PF, and CF flags are undefined.



Protected Mode Exceptions

#GP(0) If the current privilege level is not 0.

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or DR5.

If the LOCK prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.


Real-Address Mode Exceptions

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or DR5.

If the LOCK prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.


Virtual-8086 Mode Exceptions

#GP(0) The debug registers cannot be loaded or read when in virtual-8086 mode.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#GP(0) If the current privilege level is not 0.

If an attempt is made to write a 1 to any of bits 63:32 in DR6. If an attempt is made to write a 1 to any of bits 63:32 in DR7.

#UD If CR4.DE[bit 3] = 1 (debug extensions) and a MOV instruction is executed involving DR4 or DR5.

If the LOCK prefix is used. If the REX.R prefix is used.

#DB If any debug register is accessed while the DR7.GD[bit 13] = 1.


MOVAPD—Move Aligned Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op/En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 28 /r

MOVAPD xmm1, xmm2/m128

A

V/V

SSE2

Move aligned packed double-precision floating- point values from xmm2/mem to xmm1.

66 0F 29 /r

MOVAPD xmm2/m128, xmm1

B

V/V

SSE2

Move aligned packed double-precision floating- point values from xmm1 to xmm2/mem.

VEX.128.66.0F.WIG 28 /r

VMOVAPD xmm1, xmm2/m128

A

V/V

AVX

Move aligned packed double-precision floating- point values from xmm2/mem to xmm1.

VEX.128.66.0F.WIG 29 /r

VMOVAPD xmm2/m128, xmm1

B

V/V

AVX

Move aligned packed double-precision floating- point values from xmm1 to xmm2/mem.

VEX.256.66.0F.WIG 28 /r

VMOVAPD ymm1, ymm2/m256

A

V/V

AVX

Move aligned packed double-precision floating- point values from ymm2/mem to ymm1.

VEX.256.66.0F.WIG 29 /r

VMOVAPD ymm2/m256, ymm1

B

V/V

AVX

Move aligned packed double-precision floating- point values from ymm1 to ymm2/mem.

EVEX.128.66.0F.W1 28 /r

VMOVAPD xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512F

Move aligned packed double-precision floating- point values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F.W1 28 /r

VMOVAPD ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move aligned packed double-precision floating- point values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F.W1 28 /r

VMOVAPD zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move aligned packed double-precision floating- point values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.66.0F.W1 29 /r

VMOVAPD xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move aligned packed double-precision floating- point values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.66.0F.W1 29 /r

VMOVAPD ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move aligned packed double-precision floating- point values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.66.0F.W1 29 /r

VMOVAPD zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move aligned packed double-precision floating- point values from zmm1 to zmm2/m512 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA



Description

Moves 2, 4 or 8 double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256- bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit or 512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the memory operand. To move double-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. EVEX.512 encoded version:

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float64 memory location, to store the contents of a ZMM register into a 512-bit float64 memory location, or to move data between two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

VEX.256 and EVEX.256 encoded versions:

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move double-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit versions:

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating- point values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain unchanged.

(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register destination are zeroed.


Operation

VMOVAPD (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VMOVAPD (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking


FI;

ENDFOR;


VMOVAPD (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVAPD (VEX.256 encoded version, load - and register copy)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVAPD (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]


VMOVAPD (VEX.128 encoded version, load - and register copy)

DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


MOVAPD (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)


(V)MOVAPD (128-bit store-form version)

DEST[127:0] SRC[127:0]



Intel C/C++ Compiler Intrinsic Equivalent

VMOVAPD m512d _mm512_load_pd( void * m);

VMOVAPD m512d _mm512_mask_load_pd( m512d s, mmask8 k, void * m); VMOVAPD m512d _mm512_maskz_load_pd( mmask8 k, void * m); VMOVAPD void _mm512_store_pd( void * d, m512d a);

VMOVAPD void _mm512_mask_store_pd( void * d, mmask8 k, m512d a); VMOVAPD m256d _mm256_mask_load_pd( m256d s, mmask8 k, void * m); VMOVAPD m256d _mm256_maskz_load_pd( mmask8 k, void * m); VMOVAPD void _mm256_mask_store_pd( void * d, mmask8 k, m256d a); VMOVAPD m128d _mm_mask_load_pd( m128d s, mmask8 k, void * m); VMOVAPD m128d _mm_maskz_load_pd( mmask8 k, void * m);

VMOVAPD void _mm_mask_store_pd( void * d, mmask8 k, m128d a); MOVAPD m256d _mm256_load_pd (double * p);

MOVAPD void _mm256_store_pd(double * p, m256d a); MOVAPD m128d _mm_load_pd (double * p);

MOVAPD void _mm_store_pd(double * p, m128d a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2; EVEX-encoded instruction, see Exceptions Type E1.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op/En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 28 /r

MOVAPS xmm1, xmm2/m128

A

V/V

SSE

Move aligned packed single-precision floating-point values from xmm2/mem to xmm1.

NP 0F 29 /r

MOVAPS xmm2/m128, xmm1

B

V/V

SSE

Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem.

VEX.128.0F.WIG 28 /r

VMOVAPS xmm1, xmm2/m128

A

V/V

AVX

Move aligned packed single-precision floating-point values from xmm2/mem to xmm1.

VEX.128.0F.WIG 29 /r

VMOVAPS xmm2/m128, xmm1

B

V/V

AVX

Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem.

VEX.256.0F.WIG 28 /r

VMOVAPS ymm1, ymm2/m256

A

V/V

AVX

Move aligned packed single-precision floating-point values from ymm2/mem to ymm1.

VEX.256.0F.WIG 29 /r

VMOVAPS ymm2/m256, ymm1

B

V/V

AVX

Move aligned packed single-precision floating-point values from ymm1 to ymm2/mem.

EVEX.128.0F.W0 28 /r

VMOVAPS xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512F

Move aligned packed single-precision floating-point values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.0F.W0 28 /r

VMOVAPS ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move aligned packed single-precision floating-point values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.0F.W0 28 /r

VMOVAPS zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move aligned packed single-precision floating-point values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.0F.W0 29 /r

VMOVAPS xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move aligned packed single-precision floating-point values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.0F.W0 29 /r

VMOVAPS ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move aligned packed single-precision floating-point values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.0F.W0 29 /r

VMOVAPS zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move aligned packed single-precision floating-point values from zmm1 to zmm2/m512 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Moves 4, 8 or 16 single-precision floating-point values from the source operand (second operand) to the destina- tion operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-bit or 512-bit memory location, to store the contents of an XMM, YMM or ZMM register into a 128-bit, 256-bit or 512-bit memory location, or to move data between two XMM, two YMM or two ZMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general- protection exception (#GP) will be generated. For EVEX.512 encoded versions, the operand must be aligned to the size of the memory operand. To move single-precision floating-point values to and from unaligned memory loca- tions, use the VMOVUPS instruction.



Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. EVEX.512 encoded version:

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32 memory location, to store the contents of a ZMM register into a float32 memory location, or to move data between two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 64-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating- point values to and from unaligned memory locations, use the VMOVUPS instruction.

VEX.256 and EVEX.256 encoded version:

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated.

128-bit versions:

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating- point values to and from unaligned memory locations, use the VMOVUPS instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain unchanged.

(E)VEX.128 encoded version: Bits (MAXVL-1:128) of the destination ZMM register are zeroed.


Operation

VMOVAPS (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVAPS (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;



VMOVAPS (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVAPS (VEX.256 encoded version, load - and register copy)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVAPS (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]


VMOVAPS (VEX.128 encoded version, load - and register copy)

DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


MOVAPS (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)


(V)MOVAPS (128-bit store-form version)

DEST[127:0] SRC[127:0]


Intel C/C++ Compiler Intrinsic Equivalent

VMOVAPS m512 _mm512_load_ps( void * m);

VMOVAPS m512 _mm512_mask_load_ps( m512 s, mmask16 k, void * m); VMOVAPS m512 _mm512_maskz_load_ps( mmask16 k, void * m); VMOVAPS void _mm512_store_ps( void * d, m512 a);

VMOVAPS void _mm512_mask_store_ps( void * d, mmask16 k, m512 a); VMOVAPS m256 _mm256_mask_load_ps( m256 a, mmask8 k, void * s); VMOVAPS m256 _mm256_maskz_load_ps( mmask8 k, void * s); VMOVAPS void _mm256_mask_store_ps( void * d, mmask8 k, m256 a); VMOVAPS m128 _mm_mask_load_ps( m128 a, mmask8 k, void * s); VMOVAPS m128 _mm_maskz_load_ps( mmask8 k, void * s);

VMOVAPS void _mm_mask_store_ps( void * d, mmask8 k, m128 a); MOVAPS m256 _mm256_load_ps (float * p);

MOVAPS void _mm256_store_ps(float * p, m256 a); MOVAPS m128 _mm_load_ps (float * p);

MOVAPS void _mm_store_ps(float * p, m128 a);


SIMD Floating-Point Exceptions

None



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E1.


image

Reverse byte order in m16 and move to r16. Reverse byte order in m32 and move to r32. Reverse byte order in m64 and move to r64. Reverse byte order in r16 and move to m16. Reverse byte order in r32 and move to m32.

Reverse byte order in r64 and move to m64.

Description

Compat/ Leg Mode

Valid Valid N.E.

Valid

Valid N.E.

Op/ 64-Bit

En Mode

RM Valid

RM Valid

RM Valid

MR Valid

MR Valid

MR Valid

MOVBE r16, m16 MOVBE r32, m32 MOVBE r64, m64 MOVBE m16, r16 MOVBE m32, r32

MOVBE m64, r64

0F 38 F0 /r

0F 38 F0 /r

REX.W + 0F 38 F0 /r 0F 38 F1 /r

0F 38 F1 /r

REX.W + 0F 38 F1 /r

Instruction

Opcode

MOVBE—Move Data After Swapping Bytes


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Performs a byte swap operation on the data copied from the second operand (source operand) and store the result in the first operand (destination operand). The source operand can be a general-purpose register, or memory loca- tion; the destination register can be a general-purpose register, or a memory location; however, both operands can not be registers, and only one operand can be a memory location. Both operands must be the same size, which can be a word, a doubleword or quadword.

The MOVBE instruction is provided for swapping the bytes on a read from memory or on a write to memory; thus providing support for converting little-endian values to big-endian format and vice versa.

In 64-bit mode, the instruction's default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.


Operation

TEMP SRC

IF ( OperandSize 16) THEN

DEST[7:0] TEMP[15:8];

DEST[15:8] TEMP[7:0];

ELES IF ( OperandSize 32) DEST[7:0] TEMP[31:24];

DEST[15:8] TEMP[23:16];

DEST[23:16] TEMP[15:8];

DEST[31:23] TEMP[7:0];

ELSE IF ( OperandSize 64) DEST[7:0] TEMP[63:56];

DEST[15:8] TEMP[55:48]; DEST[23:16] TEMP[47:40]; DEST[31:24] TEMP[39:32]; DEST[39:32] TEMP[31:24]; DEST[47:40] TEMP[23:16]; DEST[55:48] TEMP[15:8];

DEST[63:56] TEMP[7:0];

FI;



Flags Affected

None


Protected Mode Exceptions

#GP(0) If the destination operand is in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used. If REP (F3H) prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used. If REP (F3H) prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used. If REP (F3H) prefix is used.

If REPNE (F2H) prefix is used and CPUID.01H:ECX.SSE4_2[bit 20] = 0.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#GP(0) If the memory address is in a non-canonical form.

#SS(0) If the stack address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If CPUID.01H:ECX.MOVBE[bit 22] = 0.

If the LOCK prefix is used. If REP (F3H) prefix is used.


MOVD/MOVQ—Move Doubleword/Move Quadword

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

NP 0F 6E /r

MOVD mm, r/m32

A

V/V

MMX

Move doubleword from r/m32 to mm.

NP REX.W + 0F 6E /r

MOVQ mm, r/m64

A

V/N.E.

MMX

Move quadword from r/m64 to mm.

NP 0F 7E /r

MOVD r/m32, mm

B

V/V

MMX

Move doubleword from mm to r/m32.

NP REX.W + 0F 7E /r

MOVQ r/m64, mm

B

V/N.E.

MMX

Move quadword from mm to r/m64.

66 0F 6E /r

MOVD xmm, r/m32

A

V/V

SSE2

Move doubleword from r/m32 to xmm.

66 REX.W 0F 6E /r

MOVQ xmm, r/m64

A

V/N.E.

SSE2

Move quadword from r/m64 to xmm.

66 0F 7E /r

MOVD r/m32, xmm

B

V/V

SSE2

Move doubleword from xmm register to r/m32.

66 REX.W 0F 7E /r

MOVQ r/m64, xmm

B

V/N.E.

SSE2

Move quadword from xmm register to r/m64.

VEX.128.66.0F.W0 6E /

VMOVD xmm1, r32/m32

A

V/V

AVX

Move doubleword from r/m32 to xmm1.

VEX.128.66.0F.W1 6E /r

VMOVQ xmm1, r64/m64

A

V/N.E1.

AVX

Move quadword from r/m64 to xmm1.

VEX.128.66.0F.W0 7E /r

VMOVD r32/m32, xmm1

B

V/V

AVX

Move doubleword from xmm1 register to r/m32.

VEX.128.66.0F.W1 7E /r

VMOVQ r64/m64, xmm1

B

V/N.E1.

AVX

Move quadword from xmm1 register to r/m64.

EVEX.128.66.0F.W0 6E /r

VMOVD xmm1, r32/m32

C

V/V

AVX512F

Move doubleword from r/m32 to xmm1.

EVEX.128.66.0F.W1 6E /r

VMOVQ xmm1, r64/m64

C

V/N.E.1

AVX512F

Move quadword from r/m64 to xmm1.

EVEX.128.66.0F.W0 7E /r

VMOVD r32/m32, xmm1

D

V/V

AVX512F

Move doubleword from xmm1 register to r/m32.

EVEX.128.66.0F.W1 7E /r

VMOVQ r64/m64, xmm1

D

V/N.E.1

AVX512F

Move quadword from xmm1 register to r/m64.

NOTES:

1. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 ver- sion is used.



Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Copies a doubleword from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be general-purpose registers, MMX technology registers, XMM registers, or 32-bit memory locations. This instruction can be used to move a doubleword to and from the low doubleword of an MMX technology register and a general-purpose register or a 32-bit memory location, or to and from the low doubleword of an XMM register and a general-purpose register or a 32-bit memory location. The instruction cannot be used to transfer data between MMX technology registers, between XMM registers, between general-purpose registers, or between memory locations.

When the destination operand is an MMX technology register, the source operand is written to the low doubleword of the register, and the register is zero-extended to 64 bits. When the destination operand is an XMM register, the source operand is written to the low doubleword of the register, and the register is zero-extended to 128 bits.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.

MOVD/Q with XMM destination:

Moves a dword/qword integer from the source operand and stores it in the low 32/64-bits of the destination XMM register. The upper bits of the destination are zeroed. The source operand can be a 32/64-bit register or 32/64-bit memory location.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged. Qword operation requires the use of REX.W=1.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires the use of VEX.W=1.

EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires the use of EVEX.W=1.


MOVD/Q with 32/64 reg/mem destination:

Stores the low dword/qword of the source XMM register to 32/64-bit memory location or general-purpose register. Qword operation requires the use of REX.W=1, VEX.W=1, or EVEX.W=1.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.


Operation

MOVD (when destination operand is MMX technology register)

DEST[31:0] SRC; DEST[63:32] 00000000H;

MOVD (when destination operand is XMM register)

DEST[31:0] SRC;

DEST[127:32] 000000000000000000000000H;

DEST[MAXVL-1:128] (Unmodified)



MOVD (when source operand is MMX technology or XMM register)

DEST SRC[31:0];


VMOVD (VEX-encoded version when destination is an XMM register)

DEST[31:0] SRC[31:0] DEST[MAXVL-1:32] 0

MOVQ (when destination operand is XMM register)

DEST[63:0] SRC[63:0];

DEST[127:64] 0000000000000000H;

DEST[MAXVL-1:128] (Unmodified)

MOVQ (when destination operand is r/m64)

DEST[63:0] SRC[63:0];

MOVQ (when source operand is XMM register or r/m64)

DEST SRC[63:0];


VMOVQ (VEX-encoded version when destination is an XMM register)

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVD (EVEX-encoded version when destination is an XMM register)

DEST[31:0] SRC[31:0] DEST[MAXVL-1:32] 0


VMOVQ (EVEX-encoded version when destination is an XMM register)

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


Intel C/C Compiler Intrinsic Equivalent

MOVD: m64 _mm_cvtsi32_si64 (int i )

MOVD: int _mm_cvtsi64_si32 ( m64m )

MOVD: m128i _mm_cvtsi32_si128 (int a)

MOVD: int _mm_cvtsi128_si32 ( m128i a)

MOVQ: MOVQ: VMOVD

int64 _mm_cvtsi128_si64( m128i);

m128i _mm_cvtsi64_si128( int64);

m128i _mm_cvtsi32_si128( int);

VMOVD int _mm_cvtsi128_si32( m128i );

VMOVQ VMOVQ VMOVQ

m128i _mm_cvtsi64_si128 ( int64);

int64 _mm_cvtsi128_si64( m128i );

m128i _mm_loadl_epi64( m128i * s);

VMOVQ void _mm_storel_epi64( m128i * d, m128i s);


Flags Affected

None


SIMD Floating-Point Exceptions

None



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5. EVEX-encoded instruction, see Exceptions Type E9NF.

#UD If VEX.L = 1.

If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


MOVDDUP—Replicate Double FP Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F2 0F 12 /r

MOVDDUP xmm1, xmm2/m64

A

V/V

SSE3

Move double-precision floating-point value from xmm2/m64 and duplicate into xmm1.

VEX.128.F2.0F.WIG 12 /r

VMOVDDUP xmm1, xmm2/m64

A

V/V

AVX

Move double-precision floating-point value from xmm2/m64 and duplicate into xmm1.

VEX.256.F2.0F.WIG 12 /r

VMOVDDUP ymm1, ymm2/m256

A

V/V

AVX

Move even index double-precision floating-point values from ymm2/mem and duplicate each element into ymm1.

EVEX.128.F2.0F.W1 12 /r VMOVDDUP xmm1 {k1}{z},

xmm2/m64

B

V/V

AVX512VL AVX512F

Move double-precision floating-point value from xmm2/m64 and duplicate each element into xmm1 subject to writemask k1.

EVEX.256.F2.0F.W1 12 /r VMOVDDUP ymm1 {k1}{z},

ymm2/m256

B

V/V

AVX512VL AVX512F

Move even index double-precision floating-point values from ymm2/m256 and duplicate each element into ymm1 subject to writemask k1.

EVEX.512.F2.0F.W1 12 /r VMOVDDUP zmm1 {k1}{z},

zmm2/m512

B

V/V

AVX512F

Move even index double-precision floating-point values from zmm2/m512 and duplicate each element into zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

MOVDDUP

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

For 256-bit or higher versions: Duplicates even-indexed double-precision floating-point values from the source operand (the second operand) and into adjacent pair and store to the destination operand (the first operand).

For 128-bit versions: Duplicates the low double-precision floating-point value from the source operand (the second operand) and store to the destination operand (the first operand).

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register are unchanged. The source operand is XMM register or a 64-bit memory location.

VEX.128 and EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. The source operand is XMM register or a 64-bit memory location. The destination is updated conditionally under the writemask for EVEX version.

VEX.256 and EVEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed. The source operand is YMM register or a 256-bit memory location. The destination is updated conditionally under the writemask for EVEX version.

EVEX.512 encoded version: The destination is updated according to the writemask. The source operand is ZMM register or a 512-bit memory location.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.


image

X0

X1

X2

X3

SRC



X2

X2

X0

X0

DEST


Figure 4-2. VMOVDDUP Operation


Operation

VMOVDDUP (EVEX encoded versions) (KL, VL) = (2, 128), (4, 256), (8, 512) TMP_SRC[63:0] SRC[63:0] TMP_SRC[127:64] SRC[63:0]

IF VL >= 256

TMP_SRC[191:128] SRC[191:128] TMP_SRC[255:192] SRC[191:128]

FI;

IF VL >= 512

TMP_SRC[319:256] SRC[319:256] TMP_SRC[383:320] SRC[319:256] TMP_SRC[477:384] SRC[477:384] TMP_SRC[511:484] SRC[477:384]

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDDUP (VEX.256 encoded version)

DEST[63:0] SRC[63:0] DEST[127:64] SRC[63:0] DEST[191:128] SRC[191:128] DEST[255:192] SRC[191:128] DEST[MAXVL-1:256] 0


VMOVDDUP (VEX.128 encoded version)

DEST[63:0] SRC[63:0] DEST[127:64] SRC[63:0] DEST[MAXVL-1:128] 0



MOVDDUP (128-bit Legacy SSE version)

DEST[63:0] SRC[63:0] DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMOVDDUP m512d _mm512_movedup_pd( m512d a);

VMOVDDUP m512d _mm512_mask_movedup_pd( m512d s, mmask8 k, m512d a); VMOVDDUP m512d _mm512_maskz_movedup_pd( mmask8 k, m512d a); VMOVDDUP m256d _mm256_mask_movedup_pd( m256d s, mmask8 k, m256d a); VMOVDDUP m256d _mm256_maskz_movedup_pd( mmask8 k, m256d a); VMOVDDUP m128d _mm_mask_movedup_pd( m128d s, mmask8 k, m128d a); VMOVDDUP m128d _mm_maskz_movedup_pd( mmask8 k, m128d a);

MOVDDUP m256d _mm256_movedup_pd ( m256d a); MOVDDUP m128d _mm_movedup_pd ( m128d a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; EVEX-encoded instruction, see Exceptions Type E5NF.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVDQA,VMOVDQA32/64—Move Aligned Packed Integer Values

Opcode/ Instruction

Op/En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 6F /r

MOVDQA xmm1, xmm2/m128

A

V/V

SSE2

Move aligned packed integer values from xmm2/mem to xmm1.

66 0F 7F /r

MOVDQA xmm2/m128, xmm1

B

V/V

SSE2

Move aligned packed integer values from xmm1 to xmm2/mem.

VEX.128.66.0F.WIG 6F /r

VMOVDQA xmm1, xmm2/m128

A

V/V

AVX

Move aligned packed integer values from xmm2/mem to xmm1.

VEX.128.66.0F.WIG 7F /r

VMOVDQA xmm2/m128, xmm1

B

V/V

AVX

Move aligned packed integer values from xmm1 to xmm2/mem.

VEX.256.66.0F.WIG 6F /r

VMOVDQA ymm1, ymm2/m256

A

V/V

AVX

Move aligned packed integer values from ymm2/mem to ymm1.

VEX.256.66.0F.WIG 7F /r

VMOVDQA ymm2/m256, ymm1

B

V/V

AVX

Move aligned packed integer values from ymm1 to ymm2/mem.

EVEX.128.66.0F.W0 6F /r VMOVDQA32 xmm1 {k1}{z},

xmm2/m128

C

V/V

AVX512VL AVX512F

Move aligned packed doubleword integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F.W0 6F /r VMOVDQA32 ymm1 {k1}{z},

ymm2/m256

C

V/V

AVX512VL AVX512F

Move aligned packed doubleword integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F.W0 6F /r VMOVDQA32 zmm1 {k1}{z},

zmm2/m512

C

V/V

AVX512F

Move aligned packed doubleword integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.66.0F.W0 7F /r VMOVDQA32 xmm2/m128 {k1}{z},

xmm1

D

V/V

AVX512VL AVX512F

Move aligned packed doubleword integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.66.0F.W0 7F /r VMOVDQA32 ymm2/m256 {k1}{z},

ymm1

D

V/V

AVX512VL AVX512F

Move aligned packed doubleword integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.66.0F.W0 7F /r VMOVDQA32 zmm2/m512 {k1}{z},

zmm1

D

V/V

AVX512F

Move aligned packed doubleword integer values from zmm1 to zmm2/m512 using writemask k1.

EVEX.128.66.0F.W1 6F /r VMOVDQA64 xmm1 {k1}{z},

xmm2/m128

C

V/V

AVX512VL AVX512F

Move aligned quadword integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.66.0F.W1 6F /r VMOVDQA64 ymm1 {k1}{z},

ymm2/m256

C

V/V

AVX512VL AVX512F

Move aligned quadword integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.66.0F.W1 6F /r VMOVDQA64 zmm1 {k1}{z},

zmm2/m512

C

V/V

AVX512F

Move aligned packed quadword integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.66.0F.W1 7F /r VMOVDQA64 xmm2/m128 {k1}{z},

xmm1

D

V/V

AVX512VL AVX512F

Move aligned packed quadword integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.66.0F.W1 7F /r VMOVDQA64 ymm2/m256 {k1}{z},

ymm1

D

V/V

AVX512VL AVX512F

Move aligned packed quadword integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.66.0F.W1 7F /r VMOVDQA64 zmm2/m512 {k1}{z},

zmm1

D

V/V

AVX512F

Move aligned packed quadword integer values from zmm1 to zmm2/m512 using writemask k1.



Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD. EVEX encoded versions:

Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second operand) to the destination operand (the first operand). This instruction can be used to load a vector register from an int32/int64 memory location, to store the contents of a vector register into an int32/int64 memory location, or to move data between two ZMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16 (EVEX.128)/32(EVEX.256)/64(EVEX.512)-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction.

The destination operand is updated at 32-bit (VMOVDQA32) or 64-bit (VMOVDQA64) granularity according to the writemask.

VEX.256 encoded version:

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction. Bits (MAXVL-1:256) of the destination register are zeroed.

128-bit versions:

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain unchanged.

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.



Operation

VMOVDQA32 (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQA32 (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;


VMOVDQA32 (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VMOVDQA64 (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQA64 (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking

FI;

ENDFOR;


VMOVDQA64 (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQA (VEX.256 encoded version, load - and register copy)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVDQA (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]

VMOVDQA (VEX.128 encoded version) DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


VMOVDQA (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)



(V)MOVDQA (128-bit store-form version)

DEST[127:0] SRC[127:0]


Intel C/C++ Compiler Intrinsic Equivalent

VMOVDQA32 m512i _mm512_load_epi32( void * sa);

VMOVDQA32 m512i _mm512_mask_load_epi32( m512i s, mmask16 k, void * sa); VMOVDQA32 m512i _mm512_maskz_load_epi32( mmask16 k, void * sa); VMOVDQA32 void _mm512_store_epi32(void * d, m512i a);

VMOVDQA32 void _mm512_mask_store_epi32(void * d, mmask16 k, m512i a); VMOVDQA32 m256i _mm256_mask_load_epi32( m256i s, mmask8 k, void * sa); VMOVDQA32 m256i _mm256_maskz_load_epi32( mmask8 k, void * sa); VMOVDQA32 void _mm256_store_epi32(void * d, m256i a);

VMOVDQA32 void _mm256_mask_store_epi32(void * d, mmask8 k, m256i a); VMOVDQA32 m128i _mm_mask_load_epi32( m128i s, mmask8 k, void * sa); VMOVDQA32 m128i _mm_maskz_load_epi32( mmask8 k, void * sa); VMOVDQA32 void _mm_store_epi32(void * d, m128i a);

VMOVDQA32 void _mm_mask_store_epi32(void * d, mmask8 k, m128i a); VMOVDQA64 m512i _mm512_load_epi64( void * sa);

VMOVDQA64 m512i _mm512_mask_load_epi64( m512i s, mmask8 k, void * sa); VMOVDQA64 m512i _mm512_maskz_load_epi64( mmask8 k, void * sa); VMOVDQA64 void _mm512_store_epi64(void * d, m512i a);

VMOVDQA64 void _mm512_mask_store_epi64(void * d, mmask8 k, m512i a); VMOVDQA64 m256i _mm256_mask_load_epi64( m256i s, mmask8 k, void * sa); VMOVDQA64 m256i _mm256_maskz_load_epi64( mmask8 k, void * sa); VMOVDQA64 void _mm256_store_epi64(void * d, m256i a);

VMOVDQA64 void _mm256_mask_store_epi64(void * d, mmask8 k, m256i a); VMOVDQA64 m128i _mm_mask_load_epi64( m128i s, mmask8 k, void * sa); VMOVDQA64 m128i _mm_maskz_load_epi64( mmask8 k, void * sa); VMOVDQA64 void _mm_store_epi64(void * d, m128i a);

VMOVDQA64 void _mm_mask_store_epi64(void * d, mmask8 k, m128i a); MOVDQA void m256i _mm256_load_si256 ( m256i * p);

MOVDQA _mm256_store_si256(_m256i *p, m256i a); MOVDQA m128i _mm_load_si128 ( m128i * p); MOVDQA void _mm_store_si128( m128i *p, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2; EVEX-encoded instruction, see Exceptions Type E1.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVDQU,VMOVDQU8/16/32/64—Move Unaligned Packed Integer Values

Opcode/ Instruction

Op/En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 6F /r

MOVDQU xmm1, xmm2/m128

A

V/V

SSE2

Move unaligned packed integer values from xmm2/m128 to xmm1.

F3 0F 7F /r

MOVDQU xmm2/m128, xmm1

B

V/V

SSE2

Move unaligned packed integer values from xmm1 to xmm2/m128.

VEX.128.F3.0F.WIG 6F /r

VMOVDQU xmm1, xmm2/m128

A

V/V

AVX

Move unaligned packed integer values from xmm2/m128 to xmm1.

VEX.128.F3.0F.WIG 7F /r

VMOVDQU xmm2/m128, xmm1

B

V/V

AVX

Move unaligned packed integer values from xmm1 to xmm2/m128.

VEX.256.F3.0F.WIG 6F /r

VMOVDQU ymm1, ymm2/m256

A

V/V

AVX

Move unaligned packed integer values from ymm2/m256 to ymm1.

VEX.256.F3.0F.WIG 7F /r

VMOVDQU ymm2/m256, ymm1

B

V/V

AVX

Move unaligned packed integer values from ymm1 to ymm2/m256.

EVEX.128.F2.0F.W0 6F /r

VMOVDQU8 xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512BW

Move unaligned packed byte integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.F2.0F.W0 6F /r

VMOVDQU8 ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512BW

Move unaligned packed byte integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.F2.0F.W0 6F /r

VMOVDQU8 zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512BW

Move unaligned packed byte integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.F2.0F.W0 7F /r

VMOVDQU8 xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512BW

Move unaligned packed byte integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.F2.0F.W0 7F /r

VMOVDQU8 ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512BW

Move unaligned packed byte integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.F2.0F.W0 7F /r

VMOVDQU8 zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512BW

Move unaligned packed byte integer values from zmm1 to zmm2/m512 using writemask k1.

EVEX.128.F2.0F.W1 6F /r

VMOVDQU16 xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512BW

Move unaligned packed word integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.F2.0F.W1 6F /r

VMOVDQU16 ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512BW

Move unaligned packed word integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.F2.0F.W1 6F /r

VMOVDQU16 zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512BW

Move unaligned packed word integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.F2.0F.W1 7F /r

VMOVDQU16 xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512BW

Move unaligned packed word integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.F2.0F.W1 7F /r

VMOVDQU16 ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512BW

Move unaligned packed word integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.F2.0F.W1 7F /r

VMOVDQU16 zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512BW

Move unaligned packed word integer values from zmm1 to zmm2/m512 using writemask k1.

EVEX.128.F3.0F.W0 6F /r VMOVDQU32 xmm1 {k1}{z},

xmm2/mm128

C

V/V

AVX512VL AVX512F

Move unaligned packed doubleword integer values from xmm2/m128 to xmm1 using writemask k1.


Opcode/ Instruction

Op/En

64/32

bit Mode Support

CPUID

Feature Flag

Description

EVEX.256.F3.0F.W0 6F /r

VMOVDQU32 ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move unaligned packed doubleword integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.F3.0F.W0 6F /r

VMOVDQU32 zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move unaligned packed doubleword integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.F3.0F.W0 7F /r

VMOVDQU32 xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move unaligned packed doubleword integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.F3.0F.W0 7F /r

VMOVDQU32 ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move unaligned packed doubleword integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.F3.0F.W0 7F /r

VMOVDQU32 zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move unaligned packed doubleword integer values from zmm1 to zmm2/m512 using writemask k1.

EVEX.128.F3.0F.W1 6F /r

VMOVDQU64 xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512F

Move unaligned packed quadword integer values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.F3.0F.W1 6F /r

VMOVDQU64 ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move unaligned packed quadword integer values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.F3.0F.W1 6F /r

VMOVDQU64 zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move unaligned packed quadword integer values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.F3.0F.W1 7F /r

VMOVDQU64 xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move unaligned packed quadword integer values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.F3.0F.W1 7F /r

VMOVDQU64 ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move unaligned packed quadword integer values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.F3.0F.W1 7F /r

VMOVDQU64 zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move unaligned packed quadword integer values from zmm1 to zmm2/m512 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.

EVEX encoded versions:

Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers.



The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit (VMOVDQU64) granularity according to the writemask.

VEX.256 encoded version:

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

Bits (MAXVL-1:256) of the destination register are zeroed.


128-bit versions:

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned to any alignment without causing a general-protection exception (#GP) to be generated

VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.


Operation

VMOVDQU8 (EVEX encoded versions, register-copy form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i] SRC[i+7:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged*

ELSE DEST[i+7:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU8 (EVEX encoded versions, store-form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask* THEN DEST[i+7:i]

SRC[i+7:i]

ELSE *DEST[i+7:i] remains unchanged* ; merging-masking

FI;

ENDFOR;



VMOVDQU8 (EVEX encoded versions, load-form)

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j 0 TO KL-1

i j * 8

IF k1[j] OR *no writemask*

THEN DEST[i+7:i] SRC[i+7:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+7:i] remains unchanged*

ELSE DEST[i+7:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU16 (EVEX encoded versions, register-copy form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC[i+15:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE DEST[i+15:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU16 (EVEX encoded versions, store-form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask* THEN DEST[i+15:i]

SRC[i+15:i]

ELSE *DEST[i+15:i] remains unchanged* ; merging-masking

FI;

ENDFOR;



VMOVDQU16 (EVEX encoded versions, load-form)

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j 0 TO KL-1

i j * 16

IF k1[j] OR *no writemask*

THEN DEST[i+15:i] SRC[i+15:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+15:i] remains unchanged*

ELSE DEST[i+15:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU32 (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU32 (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN DEST[i+31:i]

SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;



VMOVDQU32 (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU64 (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU64 (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking


FI;

ENDFOR;



VMOVDQU64 (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVDQU (VEX.256 encoded version, load - and register copy)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVDQU (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]


VMOVDQU (VEX.128 encoded version) DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


VMOVDQU (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)


(V)MOVDQU (128-bit store-form version)

DEST[127:0] SRC[127:0]


Intel C/C++ Compiler Intrinsic Equivalent

VMOVDQU16 m512i _mm512_mask_loadu_epi16( m512i s, mmask32 k, void * sa); VMOVDQU16 m512i _mm512_maskz_loadu_epi16( mmask32 k, void * sa); VMOVDQU16 void _mm512_mask_storeu_epi16(void * d, mmask32 k, m512i a); VMOVDQU16 m256i _mm256_mask_loadu_epi16( m256i s, mmask16 k, void * sa); VMOVDQU16 m256i _mm256_maskz_loadu_epi16( mmask16 k, void * sa); VMOVDQU16 void _mm256_mask_storeu_epi16(void * d, mmask16 k, m256i a); VMOVDQU16 m128i _mm_mask_loadu_epi16( m128i s, mmask8 k, void * sa); VMOVDQU16 m128i _mm_maskz_loadu_epi16( mmask8 k, void * sa);

VMOVDQU16 void _mm_mask_storeu_epi16(void * d, mmask8 k, m128i a); VMOVDQU32 m512i _mm512_loadu_epi32( void * sa);

VMOVDQU32 m512i _mm512_mask_loadu_epi32( m512i s, mmask16 k, void * sa); VMOVDQU32 m512i _mm512_maskz_loadu_epi32( mmask16 k, void * sa); VMOVDQU32 void _mm512_storeu_epi32(void * d, m512i a);

VMOVDQU32 void _mm512_mask_storeu_epi32(void * d, mmask16 k, m512i a); VMOVDQU32 m256i _mm256_mask_loadu_epi32( m256i s, mmask8 k, void * sa); VMOVDQU32 m256i _mm256_maskz_loadu_epi32( mmask8 k, void * sa); VMOVDQU32 void _mm256_storeu_epi32(void * d, m256i a);

VMOVDQU32 void _mm256_mask_storeu_epi32(void * d, mmask8 k, m256i a); VMOVDQU32 m128i _mm_mask_loadu_epi32( m128i s, mmask8 k, void * sa); VMOVDQU32 m128i _mm_maskz_loadu_epi32( mmask8 k, void * sa);



VMOVDQU32 void _mm_storeu_epi32(void * d, m128i a);

VMOVDQU32 void _mm_mask_storeu_epi32(void * d, mmask8 k, m128i a); VMOVDQU64 m512i _mm512_loadu_epi64( void * sa);

VMOVDQU64 m512i _mm512_mask_loadu_epi64( m512i s, mmask8 k, void * sa); VMOVDQU64 m512i _mm512_maskz_loadu_epi64( mmask8 k, void * sa); VMOVDQU64 void _mm512_storeu_epi64(void * d, m512i a);

VMOVDQU64 void _mm512_mask_storeu_epi64(void * d, mmask8 k, m512i a); VMOVDQU64 m256i _mm256_mask_loadu_epi64( m256i s, mmask8 k, void * sa); VMOVDQU64 m256i _mm256_maskz_loadu_epi64( mmask8 k, void * sa); VMOVDQU64 void _mm256_storeu_epi64(void * d, m256i a);

VMOVDQU64 void _mm256_mask_storeu_epi64(void * d, mmask8 k, m256i a); VMOVDQU64 m128i _mm_mask_loadu_epi64( m128i s, mmask8 k, void * sa); VMOVDQU64 m128i _mm_maskz_loadu_epi64( mmask8 k, void * sa); VMOVDQU64 void _mm_storeu_epi64(void * d, m128i a);

VMOVDQU64 void _mm_mask_storeu_epi64(void * d, mmask8 k, m128i a); VMOVDQU8 m512i _mm512_mask_loadu_epi8( m512i s, mmask64 k, void * sa); VMOVDQU8 m512i _mm512_maskz_loadu_epi8( mmask64 k, void * sa); VMOVDQU8 void _mm512_mask_storeu_epi8(void * d, mmask64 k, m512i a); VMOVDQU8 m256i _mm256_mask_loadu_epi8( m256i s, mmask32 k, void * sa); VMOVDQU8 m256i _mm256_maskz_loadu_epi8( mmask32 k, void * sa); VMOVDQU8 void _mm256_mask_storeu_epi8(void * d, mmask32 k, m256i a); VMOVDQU8 m128i _mm_mask_loadu_epi8( m128i s, mmask16 k, void * sa); VMOVDQU8 m128i _mm_maskz_loadu_epi8( mmask16 k, void * sa);

VMOVDQU8 void _mm_mask_storeu_epi8(void * d, mmask16 k, m128i a); MOVDQU m256i _mm256_loadu_si256 ( m256i * p);

MOVDQU _mm256_storeu_si256(_m256i *p, m256i a); MOVDQU m128i _mm_loadu_si128 ( m128i * p); MOVDQU _mm_storeu_si128( m128i *p, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


image

MOVDQ2Q—Move Quadword from XMM to MMX Technology Register

Opcode

Instruction

Op/ 64-Bit

En Mode

Compat/ Description Leg Mode


F2 0F D6 /r MOVDQ2Q mm, xmm RM Valid Valid Move low quadword from xmm to mmx

register.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Moves the low quadword from the source operand (second operand) to the destination operand (first operand). The source operand is an XMM register and the destination operand is an MMX technology register.

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack pointer is set to 0 and the x87 FPU tag word is set to all 0s [valid]). If this instruction is executed while an x87 FPU floating-point exception is pending, the exception is handled before the MOVDQ2Q instruction is executed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).


Operation

DEST SRC[63:0];


Intel C/C Compiler Intrinsic Equivalent

MOVDQ2Q: m64 _mm_movepi64_pi64 ( m128i a)


SIMD Floating-Point Exceptions

None.


Protected Mode Exceptions

#NM If CR0.TS[bit 3] = 1.

#UD If CR0.EM[bit 2] = 1.

If CR4.OSFXSR[bit 9] = 0.

If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#MF If there is a pending x87 FPU exception.


Real-Address Mode Exceptions

Same exceptions as in protected mode.


Virtual-8086 Mode Exceptions

Same exceptions as in protected mode.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

Same exceptions as in protected mode.


MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 12 /r

MOVHLPS xmm1, xmm2

RM

V/V

SSE

Move two packed single-precision floating-point values from high quadword of xmm2 to low quadword of xmm1.

VEX.NDS.128.0F.WIG 12 /r

VMOVHLPS xmm1, xmm2, xmm3

RVM

V/V

AVX

Merge two packed single-precision floating-point values from high quadword of xmm3 and low quadword of xmm2.

EVEX.NDS.128.0F.W0 12 /r

VMOVHLPS xmm1, xmm2, xmm3

RVM

V/V

AVX512F

Merge two packed single-precision floating-point values from high quadword of xmm3 and low quadword of xmm2.


Instruction Operand Encoding1

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

RVM

ModRM:reg (w)

vvvv (r)

ModRM:r/m (r)

NA

Description

This instruction cannot be used for memory to register moves.

128-bit two-argument form:

Moves two packed single-precision floating-point values from the high quadword of the second XMM argument (second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

128-bit and EVEX three-argument form

Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM argument (second operand) to the high quadword of the destination (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

If VMOVHLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.


Operation

MOVHLPS (128-bit two-argument form)

DEST[63:0] SRC[127:64]

DEST[MAXVL-1:64] (Unmodified)


VMOVHLPS (128-bit three-argument form - VEX & EVEX)

DEST[63:0] SRC2[127:64] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

MOVHLPS m128 _mm_movehl_ps( m128 a, m128 b)


SIMD Floating-Point Exceptions

None


image

1. ModRM.MOD = 011B required



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E7NM.128.


MOVHPD—Move High Packed Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 16 /r

MOVHPD xmm1, m64

A

V/V

SSE2

Move double-precision floating-point value from m64 to high quadword of xmm1.

VEX.NDS.128.66.0F.WIG 16 /r

VMOVHPD xmm2, xmm1, m64

B

V/V

AVX

Merge double-precision floating-point value from m64 and the low quadword of xmm1.

EVEX.NDS.128.66.0F.W1 16 /r

VMOVHPD xmm2, xmm1, m64

D

V/V

AVX512F

Merge double-precision floating-point value from m64 and the low quadword of xmm1.

66 0F 17 /r

MOVHPD m64, xmm1

C

V/V

SSE2

Move double-precision floating-point value from high quadword of xmm1 to m64.

VEX.128.66.0F.WIG 17 /r VMOVHPD m64, xmm1

C

V/V

AVX

Move double-precision floating-point value from high quadword of xmm1 to m64.

EVEX.128.66.0F.W1 17 /r VMOVHPD m64, xmm1

E

V/V

AVX512F

Move double-precision floating-point value from high quadword of xmm1 to m64.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

E

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the high 64- bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads a double-precision floating-point value from the source 64-bit memory operand (the third operand) and stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source operand (second operand) are copied to the low 64-bits of the destination. Bits (MAXVL-1:128) of the corre- sponding destination register are zeroed.

128-bit store:

Stores a double-precision floating-point value from the high 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVHPD (store) (VEX.128.66.0F 17 /r) is legal and has the same behavior as the existing 66 0F 17 store. For VMOVHPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVHPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.



Operation

MOVHPD (128-bit Legacy SSE load) DEST[63:0] (Unmodified) DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)


VMOVHPD (VEX.128 & EVEX encoded load)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0] DEST[MAXVL-1:128] 0


VMOVHPD (store)

DEST[63:0] SRC[127:64]


Intel C/C++ Compiler Intrinsic Equivalent

MOVHPD m128d _mm_loadh_pd ( m128d a, double *p) MOVHPD void _mm_storeh_pd (double *p, m128d a)


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.


MOVHPS—Move High Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 16 /r

MOVHPS xmm1, m64

A

V/V

SSE

Move two packed single-precision floating-point values from m64 to high quadword of xmm1.

VEX.NDS.128.0F.WIG 16 /r

VMOVHPS xmm2, xmm1, m64

B

V/V

AVX

Merge two packed single-precision floating-point values from m64 and the low quadword of xmm1.

EVEX.NDS.128.0F.W0 16 /r

VMOVHPS xmm2, xmm1, m64

D

V/V

AVX512F

Merge two packed single-precision floating-point values from m64 and the low quadword of xmm1.

NP 0F 17 /r

MOVHPS m64, xmm1

C

V/V

SSE

Move two packed single-precision floating-point values from high quadword of xmm1 to m64.

VEX.128.0F.WIG 17 /r VMOVHPS m64, xmm1

C

V/V

AVX

Move two packed single-precision floating-point values from high quadword of xmm1 to m64.

EVEX.128.0F.W0 17 /r VMOVHPS m64, xmm1

E

V/V

AVX512F

Move two packed single-precision floating-point values from high quadword of xmm1 to m64.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

Tuple2

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

E

Tuple2

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them in the high 64-bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads two single-precision floating-point values from the source 64-bit memory operand (the third operand) and stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source operand (the second operand) are copied to the lower 64-bits of the destination. Bits (MAXVL-1:128) of the corre- sponding destination register are zeroed.

128-bit store:

Stores two packed single-precision floating-point values from the high 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVHPS (store) (VEX.NDS.128.0F 17 /r) is legal and has the same behavior as the existing 0F 17 store. For VMOVHPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.



Operation

MOVHPS (128-bit Legacy SSE load) DEST[63:0] (Unmodified) DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)


VMOVHPS (VEX.128 and EVEX encoded load)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0] DEST[MAXVL-1:128] 0


VMOVHPS (store)

DEST[63:0] SRC[127:64]


Intel C/C++ Compiler Intrinsic Equivalent

MOVHPS m128 _mm_loadh_pi ( m128 a, m64 *p) MOVHPS void _mm_storeh_pi ( m64 *p, m128 a)


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.


MOVLHPS—Move Packed Single-Precision Floating-Point Values Low to High

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 16 /r

MOVLHPS xmm1, xmm2

RM

V/V

SSE

Move two packed single-precision floating-point values from low quadword of xmm2 to high quadword of xmm1.

VEX.NDS.128.0F.WIG 16 /r

VMOVLHPS xmm1, xmm2, xmm3

RVM

V/V

AVX

Merge two packed single-precision floating-point values from low quadword of xmm3 and low quadword of xmm2.

EVEX.NDS.128.0F.W0 16 /r

VMOVLHPS xmm1, xmm2, xmm3

RVM

V/V

AVX512F

Merge two packed single-precision floating-point values from low quadword of xmm3 and low quadword of xmm2.


Instruction Operand Encoding1

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

RVM

ModRM:reg (w)

vvvv (r)

ModRM:r/m (r)

NA

Description

This instruction cannot be used for memory to register moves.

128-bit two-argument form:

Moves two packed single-precision floating-point values from the low quadword of the second XMM argument (second operand) to the high quadword of the first XMM register (first argument). The low quadword of the desti- nation operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register are unmodified.

128-bit three-argument forms:

Moves two packed single-precision floating-point values from the low quadword of the third XMM argument (third operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM argument (second operand) to the low quadword of the destination (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

If VMOVLHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.


Operation

MOVLHPS (128-bit two-argument form)

DEST[63:0] (Unmodified) DEST[127:64] SRC[63:0]

DEST[MAXVL-1:128] (Unmodified)


VMOVLHPS (128-bit three-argument form - VEX & EVEX)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0] DEST[MAXVL-1:128] 0


Intel C/C++ Compiler Intrinsic Equivalent

MOVLHPS m128 _mm_movelh_ps( m128 a, m128 b)


SIMD Floating-Point Exceptions

None


image

1. ModRM.MOD = 011B required



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 7; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E7NM.128.


MOVLPD—Move Low Packed Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 12 /r

MOVLPD xmm1, m64

A

V/V

SSE2

Move double-precision floating-point value from m64 to low quadword of xmm1.

VEX.NDS.128.66.0F.WIG 12 /r

VMOVLPD xmm2, xmm1, m64

B

V/V

AVX

Merge double-precision floating-point value from m64 and the high quadword of xmm1.

EVEX.NDS.128.66.0F.W1 12 /r

VMOVLPD xmm2, xmm1, m64

D

V/V

AVX512F

Merge double-precision floating-point value from m64 and the high quadword of xmm1.

66 0F 13/r

MOVLPD m64, xmm1

C

V/V

SSE2

Move double-precision floating-point value from low quadword of xmm1 to m64.

VEX.128.66.0F.WIG 13/r VMOVLPD m64, xmm1

C

V/V

AVX

Move double-precision floating-point value from low quadword of xmm1 to m64.

EVEX.128.66.0F.W1 13/r VMOVLPD m64, xmm1

E

V/V

AVX512F

Move double-precision floating-point value from low quadword of xmm1 to m64.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (r)

VEX.vvvv

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

E

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the low 64- bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads a double-precision floating-point value from the source 64-bit memory operand (third operand), merges it with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the destination XMM register (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

128-bit store:

Stores a double-precision floating-point value from the low 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVLPD (store) (VEX.128.66.0F 13 /r) is legal and has the same behavior as the existing 66 0F 13 store. For VMOVLPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.

If VMOVLPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.


Operation

MOVLPD (128-bit Legacy SSE load)

DEST[63:0] SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)



VMOVLPD (VEX.128 & EVEX encoded load)

DEST[63:0] SRC2[63:0] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMOVLPD (store)

DEST[63:0] SRC[63:0]


Intel C/C++ Compiler Intrinsic Equivalent

MOVLPD m128d _mm_loadl_pd ( m128d a, double *p) MOVLPD void _mm_storel_pd (double *p, m128d a)


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.


MOVLPS—Move Low Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 12 /r

MOVLPS xmm1, m64

A

V/V

SSE

Move two packed single-precision floating-point values from m64 to low quadword of xmm1.

VEX.NDS.128.0F.WIG 12 /r

VMOVLPS xmm2, xmm1, m64

B

V/V

AVX

Merge two packed single-precision floating-point values from m64 and the high quadword of xmm1.

EVEX.NDS.128.0F.W0 12 /r

VMOVLPS xmm2, xmm1, m64

D

V/V

AVX512F

Merge two packed single-precision floating-point values from m64 and the high quadword of xmm1.

0F 13/r

MOVLPS m64, xmm1

C

V/V

SSE

Move two packed single-precision floating-point values from low quadword of xmm1 to m64.

VEX.128.0F.WIG 13/r VMOVLPS m64, xmm1

C

V/V

AVX

Move two packed single-precision floating-point values from low quadword of xmm1 to m64.

EVEX.128.0F.W0 13/r VMOVLPS m64, xmm1

E

V/V

AVX512F

Move two packed single-precision floating-point values from low quadword of xmm1 to m64.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

Tuple2

ModRM:reg (w)

EVEX.vvvv

ModRM:r/m (r)

NA

E

Tuple2

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them in the low 64-bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the corresponding destination register are preserved.

VEX.128 & EVEX encoded load:

Loads two packed single-precision floating-point values from the source 64-bit memory operand (the third operand), merges them with the upper 64-bits of the first source operand (the second operand), and stores them in the low 128-bits of the destination register (the first operand). Bits (MAXVL-1:128) of the corresponding desti- nation register are zeroed.

128-bit store:

Loads two packed single-precision floating-point values from the low 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVLPS (store) (VEX.128.0F 13 /r) is legal and has the same behavior as the existing 0F 13 store. For VMOVLPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.


If VMOVLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or EVEX.L’L= 1 will cause an #UD exception.



Operation

MOVLPS (128-bit Legacy SSE load)

DEST[63:0] SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)


VMOVLPS (VEX.128 & EVEX encoded load)

DEST[63:0] SRC2[63:0] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMOVLPS (store)

DEST[63:0] SRC[63:0]


Intel C/C++ Compiler Intrinsic Equivalent

MOVLPS m128 _mm_loadl_pi ( m128 a, m64 *p) MOVLPS void _mm_storel_pi ( m64 *p, m128 a)


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.L = 1.

EVEX-encoded instruction, see Exceptions Type E9NF.


MOVMSKPD—Extract Packed Double-Precision Floating-Point Sign Mask

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

66 0F 50 /r

MOVMSKPD reg, xmm

RM

V/V

SSE2

Extract 2-bit sign mask from xmm and store in reg. The upper bits of r32 or r64 are filled with zeros.

VEX.128.66.0F.WIG 50 /r

VMOVMSKPD reg, xmm2

RM

V/V

AVX

Extract 2-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zeroed.

VEX.256.66.0F.WIG 50 /r

VMOVMSKPD reg, ymm2

RM

V/V

AVX

Extract 4-bit sign mask from ymm2 and store in reg. The upper bits of r32 or r64 are zeroed.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Extracts the sign bits from the packed double-precision floating-point values in the source operand (second operand), formats them into a 2-bit mask, and stores the mask in the destination operand (first operand). The source operand is an XMM register, and the destination operand is a general-purpose register. The mask is stored in the 2 low-order bits of the destination operand. Zero-extend the upper bits of the destination.

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R prefix. The default operand size is 64-bit in 64-bit mode.

128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.

VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general purpose register.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

(V)MOVMSKPD (128-bit versions)

DEST[0] SRC[63] DEST[1] SRC[127] IF DEST = r32

THEN DEST[31:2] 0; ELSE DEST[63:2] 0;

FI


VMOVMSKPD (VEX.256 encoded version)

DEST[0] SRC[63] DEST[1] SRC[127] DEST[2] SRC[191] DEST[3] SRC[255] IF DEST = r32

THEN DEST[31:4] 0; ELSE DEST[63:4] 0;

FI



Intel C/C Compiler Intrinsic Equivalent MOVMSKPD: int _mm_movemask_pd ( m128d a) VMOVMSKPD: _mm256_movemask_pd( m256d a)

SIMD Floating-Point Exceptions

None


Other Exceptions

See Exceptions Type 7; additionally

#UD If VEX.vvvv ≠ 1111B.


MOVMSKPS—Extract Packed Single-Precision Floating-Point Sign Mask

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

NP 0F 50 /r

MOVMSKPS reg, xmm

RM

V/V

SSE

Extract 4-bit sign mask from xmm and store in reg. The upper bits of r32 or r64 are filled with zeros.

VEX.128.0F.WIG 50 /r

VMOVMSKPS reg, xmm2

RM

V/V

AVX

Extract 4-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zeroed.

VEX.256.0F.WIG 50 /r

VMOVMSKPS reg, ymm2

RM

V/V

AVX

Extract 8-bit sign mask from ymm2 and store in reg. The upper bits of r32 or r64 are zeroed.


Instruction Operand Encoding1

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Extracts the sign bits from the packed single-precision floating-point values in the source operand (second operand), formats them into a 4- or 8-bit mask, and stores the mask in the destination operand (first operand). The source operand is an XMM or YMM register, and the destination operand is a general-purpose register. The mask is stored in the 4 or 8 low-order bits of the destination operand. The upper bits of the destination operand beyond the mask are filled with zeros.

In 64-bit mode, the instruction can access additional registers (XMM8-XMM15, R8-R15) when used with a REX.R prefix. The default operand size is 64-bit in 64-bit mode.

128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.

VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general purpose register.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.


Operation

DEST[0] SRC[31];

DEST[1] SRC[63];

DEST[2] SRC[95]; DEST[3] SRC[127];

IF DEST = r32

THEN DEST[31:4] ZeroExtend; ELSE DEST[63:4] ZeroExtend;

FI;



(V)MOVMSKPS (128-bit version)

DEST[0] SRC[31] DEST[1] SRC[63] DEST[2] SRC[95] DEST[3] SRC[127] IF DEST = r32

THEN DEST[31:4] 0; ELSE DEST[63:4] 0;

FI


VMOVMSKPS (VEX.256 encoded version)

DEST[0] SRC[31] DEST[1] SRC[63] DEST[2] SRC[95] DEST[3] SRC[127] DEST[4] SRC[159] DEST[5] SRC[191] DEST[6] SRC[223] DEST[7] SRC[255] IF DEST = r32

THEN DEST[31:8] 0; ELSE DEST[63:8] 0;

FI


Intel C/C Compiler Intrinsic Equivalent

int _mm_movemask_ps( m128 a)

int _mm256_movemask_ps( m256 a)


SIMD Floating-Point Exceptions

None.


Other Exceptions

See Exceptions Type 7; additionally

#UD If VEX.vvvv ≠ 1111B.


MOVNTDQA—Load Double Quadword Non-Temporal Aligned Hint

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 38 2A /r MOVNTDQA xmm1, m128

A

V/V

SSE4_1

Move double quadword from m128 to xmm1 using non- temporal hint if WC memory type.

VEX.128.66.0F38.WIG 2A /r VMOVNTDQA xmm1, m128

A

V/V

AVX

Move double quadword from m128 to xmm using non- temporal hint if WC memory type.

VEX.256.66.0F38.WIG 2A /r VMOVNTDQA ymm1, m256

A

V/V

AVX2

Move 256-bit data from m256 to ymm using non-temporal hint if WC memory type.

EVEX.128.66.0F38.W0 2A /r VMOVNTDQA xmm1, m128

B

V/V

AVX512VL AVX512F

Move 128-bit data from m128 to xmm using non-temporal hint if WC memory type.

EVEX.256.66.0F38.W0 2A /r VMOVNTDQA ymm1, m256

B

V/V

AVX512VL AVX512F

Move 256-bit data from m256 to ymm using non-temporal hint if WC memory type.

EVEX.512.66.0F38.W0 2A /r VMOVNTDQA zmm1, m512

B

V/V

AVX512F

Move 512-bit data from m512 to zmm using non-temporal hint if WC memory type.


Instruction Operand Encoding1

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

Description

MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any time for any reason, for example:

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the data from memory. Using this protocol, the processor

does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC memory region. Information on non-temporal reads and writes can be found in “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architecture Software Developer’s Manual, Volume 3A.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with a MFENCE instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might use different memory types for the referenced memory locations or to synchronize reads of a processor with writes by other agents in the system. A processor’s implementation of the streaming load hint does not override the effective memory type, but the implementation of the hint is processor dependent. For example, a processor implementa-



tion may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Alter- natively, another implementation may optimize cache reads generated by MOVNTDQA on WB memory type to reduce cache evictions.

The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP. The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP. The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

Operation

MOVNTDQA (128bit- Legacy SSE form)

DEST SRC

DEST[MAXVL-1:128] (Unmodified)


VMOVNTDQA (VEX.128 and EVEX.128 encoded form)

DEST SRC DEST[MAXVL-1:128] 0


VMOVNTDQA (VEX.256 and EVEX.256 encoded forms)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVNTDQA (EVEX.512 encoded form)

DEST[511:0] SRC[511:0] DEST[MAXVL-1:512] 0


Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTDQA m512i _mm512_stream_load_si512( m512i const* p); MOVNTDQA m128i _mm_stream_load_si128 (const m128i *p); VMOVNTDQA m256i _mm256_stream_load_si256 ( m256i const* p);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1; EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


MOVNTDQ—Store Packed Integers Using Non-Temporal Hint

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F E7 /r

MOVNTDQ m128, xmm1

A

V/V

SSE2

Move packed integer values in xmm1 to m128 using non- temporal hint.

VEX.128.66.0F.WIG E7 /r VMOVNTDQ m128, xmm1

A

V/V

AVX

Move packed integer values in xmm1 to m128 using non- temporal hint.

VEX.256.66.0F.WIG E7 /r VMOVNTDQ m256, ymm1

A

V/V

AVX

Move packed integer values in ymm1 to m256 using non- temporal hint.

EVEX.128.66.0F.W0 E7 /r VMOVNTDQ m128, xmm1

B

V/V

AVX512VL AVX512F

Move packed integer values in xmm1 to m128 using non- temporal hint.

EVEX.256.66.0F.W0 E7 /r VMOVNTDQ m256, ymm1

B

V/V

AVX512VL AVX512F

Move packed integer values in zmm1 to m256 using non- temporal hint.

EVEX.512.66.0F.W0 E7 /r VMOVNTDQ m512, zmm1

B

V/V

AVX512F

Move packed integer values in zmm1 to m512 using non- temporal hint.


Instruction Operand Encoding1

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

B

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register, YMM register or ZMM register, which is assumed to contain integer data (packed bytes, words, double- words, or quadwords). The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (512-bit version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple proces- sors might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will

#UD.


Operation

VMOVNTDQ(EVEX encoded versions)

VL = 128, 256, 512 DEST[VL-1:0] SRC[VL-1:0] DEST[MAXVL-1:VL] 0



MOVNTDQ (Legacy and VEX versions)

DEST SRC


Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTDQ void _mm512_stream_si512(void * p, m512i a); VMOVNTDQ void _mm256_stream_si256 ( m256i * p, m256i a); MOVNTDQ void _mm_stream_si128 ( m128i * p, m128i a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2; EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


MOVNTI—Store Doubleword Using Non-Temporal Hint

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

NP 0F C3 /r

MOVNTI m32, r32

MR

Valid

Valid

Move doubleword from r32 to m32 using non- temporal hint.

NP REX.W + 0F C3 /r

MOVNTI m64, r64

MR

Valid

N.E.

Move quadword from r64 to m64 using non- temporal hint.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Moves the doubleword integer in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is a general-purpose register. The destination operand is a 32-bit memory location.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.


Operation

DEST SRC;


Intel C/C Compiler Intrinsic Equivalent

MOVNTI: void _mm_stream_si32 (int *p, int a)

MOVNTI: void _mm_stream_si64( int64 *p, int64 a)


SIMD Floating-Point Exceptions

None.


Protected Mode Exceptions

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.

#SS(0) For an illegal address in the SS segment.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.



Real-Address Mode Exceptions

#GP If any part of the operand lies outside the effective address space from 0 to FFFFH.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

Same exceptions as in real address mode.

#PF(fault-code) For a page fault.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) For a page fault.

#UD If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.


MOVNTPD—Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 2B /r

MOVNTPD m128, xmm1

A

V/V

SSE2

Move packed double-precision values in xmm1 to m128 using non-temporal hint.

VEX.128.66.0F.WIG 2B /r VMOVNTPD m128, xmm1

A

V/V

AVX

Move packed double-precision values in xmm1 to m128 using non-temporal hint.

VEX.256.66.0F.WIG 2B /r VMOVNTPD m256, ymm1

A

V/V

AVX

Move packed double-precision values in ymm1 to m256 using non-temporal hint.

EVEX.128.66.0F.W1 2B /r VMOVNTPD m128, xmm1

B

V/V

AVX512VL AVX512F

Move packed double-precision values in xmm1 to m128 using non-temporal hint.

EVEX.256.66.0F.W1 2B /r VMOVNTPD m256, ymm1

B

V/V

AVX512VL AVX512F

Move packed double-precision values in ymm1 to m256 using non-temporal hint.

EVEX.512.66.0F.W1 2B /r VMOVNTPD m512, zmm1

B

V/V

AVX512F

Move packed double-precision values in zmm1 to m512 using non-temporal hint.


Instruction Operand Encoding1

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

B

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Moves the packed double-precision floating-point values in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed double- precision, floating-pointing data. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will

#UD.


Operation

VMOVNTPD (EVEX encoded versions)

VL = 128, 256, 512 DEST[VL-1:0] SRC[VL-1:0] DEST[MAXVL-1:VL] 0



MOVNTPD (Legacy and VEX versions)

DEST SRC


Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTPD void _mm512_stream_pd(double * p, m512d a); VMOVNTPD void _mm256_stream_pd (double * p, m256d a); MOVNTPD void _mm_stream_pd (double * p, m128d a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE2; EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


MOVNTPS—Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 2B /r

MOVNTPS m128, xmm1

A

V/V

SSE

Move packed single-precision values xmm1 to mem using non-temporal hint.

VEX.128.0F.WIG 2B /r VMOVNTPS m128, xmm1

A

V/V

AVX

Move packed single-precision values xmm1 to mem using non-temporal hint.

VEX.256.0F.WIG 2B /r VMOVNTPS m256, ymm1

A

V/V

AVX

Move packed single-precision values ymm1 to mem using non-temporal hint.

EVEX.128.0F.W0 2B /r VMOVNTPS m128, xmm1

B

V/V

AVX512VL AVX512F

Move packed single-precision values in xmm1 to m128 using non-temporal hint.

EVEX.256.0F.W0 2B /r VMOVNTPS m256, ymm1

B

V/V

AVX512VL AVX512F

Move packed single-precision values in ymm1 to m256 using non-temporal hint.

EVEX.512.0F.W0 2B /r VMOVNTPS m512, zmm1

B

V/V

AVX512F

Move packed single-precision values in zmm1 to m512 using non-temporal hint.


Instruction Operand Encoding1

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

B

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

Description

Moves the packed single-precision floating-point values in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed single-preci- sion, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors might use different memory types to read/write the destination memory locations.

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.


Operation

VMOVNTPS (EVEX encoded versions)

VL = 128, 256, 512 DEST[VL-1:0] SRC[VL-1:0] DEST[MAXVL-1:VL] 0



MOVNTPS

DEST SRC


Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTPS void _mm512_stream_ps(float * p, m512d a); MOVNTPS void _mm_stream_ps (float * p, m128d a); VMOVNTPS void _mm256_stream_ps (float * p, m256 a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type1.SSE; additionally EVEX-encoded instruction, see Exceptions Type E1NF.

#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.


MOVNTQ—Store of Quadword Using Non-Temporal Hint

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

NP 0F E7 /r

MOVNTQ m64, mm

MR

Valid

Valid

Move quadword from mm to m64 using non- temporal hint.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

MR

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Moves the quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an MMX tech- nology register, which is assumed to contain packed integer data (packed bytes, words, or doublewords). The destination operand is a 64-bit memory location.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTQ instructions if multiple processors might use different memory types to read/write the destination memory locations.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.


Operation

DEST SRC;


Intel C/C Compiler Intrinsic Equivalent

MOVNTQ: void _mm_stream_pi( m64 * p, m64 a)


SIMD Floating-Point Exceptions

None.


Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.


MOVQ—Move Quadword

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

NP 0F 6F /r

MOVQ mm, mm/m64

A

V/V

MMX

Move quadword from mm/m64 to mm.

NP 0F 7F /r

MOVQ mm/m64, mm

B

V/V

MMX

Move quadword from mm to mm/m64.

F3 0F 7E /r

MOVQ xmm1, xmm2/m64

A

V/V

SSE2

Move quadword from xmm2/mem64 to xmm1.

VEX.128.F3.0F.WIG 7E /r

VMOVQ xmm1, xmm2/m64

A

V/V

AVX

Move quadword from xmm2 to xmm1.

EVEX.128.F3.0F.W1 7E /r

VMOVQ xmm1, xmm2/m64

C

V/V

AVX512F

Move quadword from xmm2/m64 to xmm1.

66 0F D6 /r

MOVQ xmm2/m64, xmm1

B

V/V

SSE2

Move quadword from xmm1 to xmm2/mem64.

VEX.128.66.0F.WIG D6 /r

VMOVQ xmm1/m64, xmm2

B

V/V

AVX

Move quadword from xmm2 register to xmm1/m64.

EVEX.128.66.0F.W1 D6 /r

VMOVQ xmm1/m64, xmm2

D

V/V

AVX512F

Move quadword from xmm2 register to xmm1/m64.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Copies a quadword from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations. This instruction can be used to move a quadword between two MMX technology registers or between an MMX tech- nology register and a 64-bit memory location, or to move data between two XMM registers or between an XMM register and a 64-bit memory location. The instruction cannot be used to transfer data between memory locations.

When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.

In 64-bit mode and if not encoded using VEX/EVEX, use of the REX prefix in the form of REX.R permits this instruc- tion to access additional registers (XMM8-XMM15).

Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an

#UD exception.



Operation

MOVQ instruction when operating on MMX technology registers and memory locations

DEST SRC;

MOVQ instruction when source and destination operands are XMM registers

DEST[63:0] SRC[63:0];

DEST[127:64] 0000000000000000H;

MOVQ instruction when source operand is XMM register and destination

operand is memory location: DEST SRC[63:0];

MOVQ instruction when source operand is memory location and destination

operand is XMM register:

DEST[63:0] SRC;

DEST[127:64] 0000000000000000H;


VMOVQ (VEX.NDS.128.F3.0F 7E) with XMM register source and destination

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVQ (VEX.128.66.0F D6) with XMM register source and destination

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVQ (7E - EVEX encoded version) with XMM register source and destination

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVQ (D6 - EVEX encoded version) with XMM register source and destination

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVQ (7E) with memory source DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


VMOVQ (7E - EVEX encoded version) with memory source

DEST[63:0] SRC[63:0] DEST[:MAXVL-1:64] 0


VMOVQ (D6) with memory dest

DEST[63:0] SRC2[63:0]


Flags Affected

None.


Intel C/C Compiler Intrinsic Equivalent VMOVQ m128i _mm_loadu_si64( void * s); VMOVQ void _mm_storeu_si64( void * d, m128i s);

MOVQ m128i _mm_move_epi64( m128i a)



SIMD Floating-Point Exceptions

None


Other Exceptions

See Table 22-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.


MOVQ2DQ—Move Quadword from MMX Technology to XMM Register

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

F3 0F D6 /r

MOVQ2DQ xmm, mm

RM

Valid

Valid

Move quadword from mmx to low quadword of xmm.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Moves the quadword from the source operand (second operand) to the low quadword of the destination operand (first operand). The source operand is an MMX technology register and the destination operand is an XMM register.

This instruction causes a transition from x87 FPU to MMX technology operation (that is, the x87 FPU top-of-stack pointer is set to 0 and the x87 FPU tag word is set to all 0s [valid]). If this instruction is executed while an x87 FPU floating-point exception is pending, the exception is handled before the MOVQ2DQ instruction is executed.

In 64-bit mode, use of the REX.R prefix permits this instruction to access additional registers (XMM8-XMM15).


Operation

DEST[63:0] SRC[63:0];

DEST[127:64] 00000000000000000H;


Intel C/C Compiler Intrinsic Equivalent

MOVQ2DQ: 128i _mm_movpi64_epi64 ( m64 a)


SIMD Floating-Point Exceptions

None.


Protected Mode Exceptions

#NM If CR0.TS[bit 3] = 1.

#UD If CR0.EM[bit 2] = 1.

If CR4.OSFXSR[bit 9] = 0.

If CPUID.01H:EDX.SSE2[bit 26] = 0.

If the LOCK prefix is used.

#MF If there is a pending x87 FPU exception.


Real-Address Mode Exceptions

Same exceptions as in protected mode.


Virtual-8086 Mode Exceptions

Same exceptions as in protected mode.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

Same exceptions as in protected mode.


MOVS/MOVSB/MOVSW/MOVSD/MOVSQ—Move Data from String to String

\

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

A4

MOVS m8, m8

ZO

Valid

Valid

For legacy mode, Move byte from address DS:(E)SI to ES:(E)DI. For 64-bit mode move byte from address (R|E)SI to (R|E)DI.

A5

MOVS m16, m16

ZO

Valid

Valid

For legacy mode, move word from address DS:(E)SI to ES:(E)DI. For 64-bit mode move word at address (R|E)SI to (R|E)DI.

A5

MOVS m32, m32

ZO

Valid

Valid

For legacy mode, move dword from address DS:(E)SI to ES:(E)DI. For 64-bit mode move dword from address (R|E)SI to (R|E)DI.

REX.W + A5

MOVS m64, m64

ZO

Valid

N.E.

Move qword from address (R|E)SI to (R|E)DI.

A4

MOVSB

ZO

Valid

Valid

For legacy mode, Move byte from address DS:(E)SI to ES:(E)DI. For 64-bit mode move byte from address (R|E)SI to (R|E)DI.

A5

MOVSW

ZO

Valid

Valid

For legacy mode, move word from address DS:(E)SI to ES:(E)DI. For 64-bit mode move word at address (R|E)SI to (R|E)DI.

A5

MOVSD

ZO

Valid

Valid

For legacy mode, move dword from address DS:(E)SI to ES:(E)DI. For 64-bit mode move dword from address (R|E)SI to (R|E)DI.

REX.W + A5

MOVSQ

ZO

Valid

N.E.

Move qword from address (R|E)SI to (R|E)DI.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA


Description

Moves the byte, word, or doubleword specified with the second operand (source operand) to the location specified with the first operand (destination operand). Both the source and destination operands are located in memory. The address of the source operand is read from the DS:ESI or the DS:SI registers (depending on the address-size attri- bute of the instruction, 32 or 16, respectively). The address of the destination operand is read from the ES:EDI or the ES:DI registers (again depending on the address-size attribute of the instruction). The DS segment may be overridden with a segment override prefix, but the ES segment cannot be overridden.

At the assembly-code level, two forms of this instruction are allowed: the “explicit-operands” form and the “no- operands” form. The explicit-operands form (specified with the MOVS mnemonic) allows the source and destination operands to be specified explicitly. Here, the source and destination operands should be symbols that indicate the size and location of the source value and the destination, respectively. This explicit-operands form is provided to allow documentation; however, note that the documentation provided by this form can be misleading. That is, the source and destination operand symbols must specify the correct type (size) of the operands (bytes, words, or doublewords), but they do not have to specify the correct location. The locations of the source and destination operands are always specified by the DS:(E)SI and ES:(E)DI registers, which must be loaded correctly before the move string instruction is executed.

The no-operands form provides “short forms” of the byte, word, and doubleword versions of the MOVS instruc- tions. Here also DS:(E)SI and ES:(E)DI are assumed to be the source and destination operands, respectively. The size of the source and destination operands is selected with the mnemonic: MOVSB (byte move), MOVSW (word move), or MOVSD (doubleword move).

After the move operation, the (E)SI and (E)DI registers are incremented or decremented automatically according to the setting of the DF flag in the EFLAGS register. (If the DF flag is 0, the (E)SI and (E)DI register are incre-



mented; if the DF flag is 1, the (E)SI and (E)DI registers are decremented.) The registers are incremented or decremented by 1 for byte operations, by 2 for word operations, or by 4 for doubleword operations.


NOTE

To improve performance, more recent processors support modifications to the processor’s operation during the string store operations initiated with MOVS and MOVSB. See Section 7.3.9.3 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for additional information on fast-string operation.

The MOVS, MOVSB, MOVSW, and MOVSD instructions can be preceded by the REP prefix (see “REP/REPE/REPZ

/REPNE/REPNZ—Repeat String Operation Prefix” for a description of the REP prefix) for block moves of ECX bytes, words, or doublewords.

In 64-bit mode, the instruction’s default address size is 64 bits, 32-bit address size is supported using the prefix 67H. The 64-bit addresses are specified by RSI and RDI; 32-bit address are specified by ESI and EDI. Use of the REX.W prefix promotes doubleword operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.


Operation

DEST SRC;

Non-64-bit Mode: IF (Byte move)

THEN IF DF 0

THEN

(E)SI (E)SI 1;

(E)DI (E)DI 1; ELSE

(E)SI (E)SI – 1;

(E)DI (E)DI – 1;

FI;

ELSE IF (Word move)

THEN IF DF 0

(E)SI (E)SI 2;

(E)DI (E)DI 2; FI;

ELSE

(E)SI (E)SI – 2;

(E)DI (E)DI – 2;

FI;

ELSE IF (Doubleword move)

THEN IF DF 0

(E)SI (E)SI 4;

(E)DI (E)DI 4; FI;

ELSE

(E)SI (E)SI – 4;

(E)DI (E)DI – 4;

FI;

FI;

64-bit Mode:

IF (Byte move) THEN IF DF 0

THEN



(R|E)SI (R|E)SI 1; (R|E)DI (R|E)DI 1;

ELSE

(R|E)SI (R|E)SI – 1;

(R|E)DI (R|E)DI – 1;

FI;

ELSE IF (Word move)

THEN IF DF 0

(R|E)SI (R|E)SI 2; (R|E)DI (R|E)DI 2; FI;

ELSE

(R|E)SI (R|E)SI – 2;

(R|E)DI (R|E)DI – 2;

FI;

ELSE IF (Doubleword move)

THEN IF DF 0

(R|E)SI (R|E)SI 4; (R|E)DI (R|E)DI 4; FI;

ELSE

(R|E)SI (R|E)SI – 4;

(R|E)DI (R|E)DI – 4;

FI;

ELSE IF (Quadword move)

THEN IF DF 0

(R|E)SI (R|E)SI 8; (R|E)DI (R|E)DI 8; FI;

ELSE

(R|E)SI (R|E)SI – 8;

(R|E)DI (R|E)DI – 8;

FI;

FI;


Flags Affected

None


Protected Mode Exceptions

#GP(0) If the destination is located in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.



Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


MOVSD—Move or Merge Scalar Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F2 0F 10 /r

MOVSD xmm1, xmm2

A

V/V

SSE2

Move scalar double-precision floating-point value from xmm2 to xmm1 register.

F2 0F 10 /r

MOVSD xmm1, m64

A

V/V

SSE2

Load scalar double-precision floating-point value from m64 to xmm1 register.

F2 0F 11 /r

MOVSD xmm1/m64, xmm2

C

V/V

SSE2

Move scalar double-precision floating-point value from xmm2 register to xmm1/m64.

VEX.NDS.LIG.F2.0F.WIG 10 /r

VMOVSD xmm1, xmm2, xmm3

B

V/V

AVX

Merge scalar double-precision floating-point value from xmm2 and xmm3 to xmm1 register.

VEX.LIG.F2.0F.WIG 10 /r VMOVSD xmm1, m64

D

V/V

AVX

Load scalar double-precision floating-point value from m64 to xmm1 register.

VEX.NDS.LIG.F2.0F.WIG 11 /r

VMOVSD xmm1, xmm2, xmm3

E

V/V

AVX

Merge scalar double-precision floating-point value from xmm2 and xmm3 registers to xmm1.

VEX.LIG.F2.0F.WIG 11 /r VMOVSD m64, xmm1

C

V/V

AVX

Store scalar double-precision floating-point value from xmm1 register to m64.

EVEX.NDS.LIG.F2.0F.W1 10 /r

VMOVSD xmm1 {k1}{z}, xmm2, xmm3

B

V/V

AVX512F

Merge scalar double-precision floating-point value from xmm2 and xmm3 registers to xmm1 under writemask k1.

EVEX.LIG.F2.0F.W1 10 /r VMOVSD xmm1 {k1}{z}, m64

F

V/V

AVX512F

Load scalar double-precision floating-point value from m64 to xmm1 register under writemask k1.

EVEX.NDS.LIG.F2.0F.W1 11 /r

VMOVSD xmm1 {k1}{z}, xmm2, xmm3

E

V/V

AVX512F

Merge scalar double-precision floating-point value from xmm2 and xmm3 registers to xmm1 under writemask k1.

EVEX.LIG.F2.0F.W1 11 /r VMOVSD m64 {k1}, xmm1

G

V/V

AVX512F

Store scalar double-precision floating-point value from xmm1 register to m64 under writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

E

NA

ModRM:r/m (w)

vvvv (r)

ModRM:reg (r)

NA

F

Tuple1 Scalar

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

G

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA



Description

Moves a scalar double-precision floating-point value from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations. This instruction can be used to move a double-precision floating-point value to and from the low quadword of an XMM register and a 64-bit memory location, or to move a double-precision floating-point value between the low quadwords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

Legacy version: When the source and destination operands are XMM registers, bits MAXVL:64 of the destination operand remains unchanged. When the source operand is a memory location and destination operand is an XMM registers, the quadword at bits 127:64 of the destination operand is cleared to all 0s, bits MAXVL:128 of the desti- nation operand remains unchanged.

VEX and EVEX encoded register-register syntax: Moves a scalar double-precision floating-point value from the second source operand (the third operand) to the low quadword element of the destination operand (the first operand). Bits 127:64 of the destination operand are copied from the first source operand (the second operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

VEX and EVEX encoded memory store syntax: When the source operand is a memory location and destination operand is an XMM registers, bits MAXVL:64 of the destination operand is cleared to all 0s.

EVEX encoded versions: The low quadword of the destination is updated according to the writemask.

Note: For VMOVSD (memory store and load forms), VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instruction will #UD.


Operation

VMOVSD (EVEX.NDS.LIG.F2.0F 10 /r: VMOVSD xmm1, m64 with support for 32 registers)

IF k1[0] or *no writemask*

THEN DEST[63:0] SRC[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0] 0

FI;

FI;

DEST[MAXVL-1:64] 0


VMOVSD (EVEX.NDS.LIG.F2.0F 11 /r: VMOVSD m64, xmm1 with support for 32 registers)

IF k1[0] or *no writemask*

THEN DEST[63:0] SRC[63:0]

ELSE *DEST[63:0] remains unchanged* ; merging-masking

FI;


VMOVSD (EVEX.NDS.LIG.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)

IF k1[0] or *no writemask*

THEN DEST[63:0] SRC2[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0] 0

FI;

FI;

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0



MOVSD (128-bit Legacy SSE version: MOVSD XMM1, XMM2)

DEST[63:0] SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)


VMOVSD (VEX.NDS.128.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)

DEST[63:0] SRC2[63:0] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmm1, xmm2, xmm3)

DEST[63:0] SRC2[63:0] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmm1, m64)

DEST[63:0] SRC[63:0] DEST[MAXVL-1:64] 0


MOVSD/VMOVSD (128-bit versions: MOVSD m64, xmm1 or VMOVSD m64, xmm1)

DEST[63:0] SRC[63:0]


MOVSD (128-bit Legacy SSE version: MOVSD XMM1, m64)

DEST[63:0] SRC[63:0] DEST[127:64] 0

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMOVSD m128d _mm_mask_load_sd( m128d s, mmask8 k, double * p); VMOVSD m128d _mm_maskz_load_sd( mmask8 k, double * p);

VMOVSD m128d _mm_mask_move_sd( m128d sh, mmask8 k, m128d sl, m128d a); VMOVSD m128d _mm_maskz_move_sd( mmask8 k, m128d s, m128d a);

VMOVSD void _mm_mask_store_sd(double * p, mmask8 k, m128d s); MOVSD m128d _mm_load_sd (double *p)

MOVSD void _mm_store_sd (double *p, m128d a)

MOVSD m128d _mm_move_sd ( m128d a, m128d b)


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E10.


MOVSHDUP—Replicate Single FP Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 16 /r

MOVSHDUP xmm1, xmm2/m128

A

V/V

SSE3

Move odd index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.

VEX.128.F3.0F.WIG 16 /r

VMOVSHDUP xmm1, xmm2/m128

A

V/V

AVX

Move odd index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.

VEX.256.F3.0F.WIG 16 /r

VMOVSHDUP ymm1, ymm2/m256

A

V/V

AVX

Move odd index single-precision floating-point values from ymm2/mem and duplicate each element into ymm1.

EVEX.128.F3.0F.W0 16 /r VMOVSHDUP xmm1 {k1}{z},

xmm2/m128

B

V/V

AVX512VL AVX512F

Move odd index single-precision floating-point values from xmm2/m128 and duplicate each element into xmm1 under writemask.

EVEX.256.F3.0F.W0 16 /r VMOVSHDUP ymm1 {k1}{z},

ymm2/m256

B

V/V

AVX512VL AVX512F

Move odd index single-precision floating-point values from ymm2/m256 and duplicate each element into ymm1 under writemask.

EVEX.512.F3.0F.W0 16 /r VMOVSHDUP zmm1 {k1}{z},

zmm2/m512

B

V/V

AVX512F

Move odd index single-precision floating-point values from zmm2/m512 and duplicate each element into zmm1 under writemask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Duplicates odd-indexed single-precision floating-point values from the source operand (the second operand) to adjacent element pair in the destination operand (the first operand). See Figure 4-3. The source operand is an XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, YMM or ZMM register.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged. VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.


image

X0

X1

X2

X3

X4

X5

X6

X7

SRC



X7

X7

X5

X5

X3

X3

X1

X1

DEST


Figure 4-3. MOVSHDUP Operation



Operation

VMOVSHDUP (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) TMP_SRC[31:0] SRC[63:32] TMP_SRC[63:32] SRC[63:32] TMP_SRC[95:64] SRC[127:96] TMP_SRC[127:96] SRC[127:96]

IF VL >= 256

TMP_SRC[159:128] SRC[191:160] TMP_SRC[191:160] SRC[191:160] TMP_SRC[223:192] SRC[255:224] TMP_SRC[255:224] SRC[255:224]

FI;

IF VL >= 512

TMP_SRC[287:256] SRC[319:288] TMP_SRC[319:288] SRC[319:288] TMP_SRC[351:320] SRC[383:352] TMP_SRC[383:352] SRC[383:352] TMP_SRC[415:384] SRC[447:416] TMP_SRC[447:416] SRC[447:416] TMP_SRC[479:448] SRC[511:480] TMP_SRC[511:480] SRC[511:480]

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVSHDUP (VEX.256 encoded version)

DEST[31:0] SRC[63:32] DEST[63:32] SRC[63:32] DEST[95:64] SRC[127:96] DEST[127:96] SRC[127:96] DEST[159:128] SRC[191:160] DEST[191:160] SRC[191:160] DEST[223:192] SRC[255:224] DEST[255:224] SRC[255:224] DEST[MAXVL-1:256] 0


VMOVSHDUP (VEX.128 encoded version)

DEST[31:0] SRC[63:32] DEST[63:32] SRC[63:32] DEST[95:64] SRC[127:96] DEST[127:96] SRC[127:96] DEST[MAXVL-1:128] 0


MOVSHDUP—Replicate Single FP Values Vol. 2B 4-115



MOVSHDUP (128-bit Legacy SSE version)

DEST[31:0] SRC[63:32] DEST[63:32] SRC[63:32] DEST[95:64] SRC[127:96] DEST[127:96] SRC[127:96]

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMOVSHDUP m512 _mm512_movehdup_ps( m512 a);

VMOVSHDUP m512 _mm512_mask_movehdup_ps( m512 s, mmask16 k, m512 a); VMOVSHDUP m512 _mm512_maskz_movehdup_ps( mmask16 k, m512 a); VMOVSHDUP m256 _mm256_mask_movehdup_ps( m256 s, mmask8 k, m256 a); VMOVSHDUP m256 _mm256_maskz_movehdup_ps( mmask8 k, m256 a); VMOVSHDUP m128 _mm_mask_movehdup_ps( m128 s, mmask8 k, m128 a); VMOVSHDUP m128 _mm_maskz_movehdup_ps( mmask8 k, m128 a);

VMOVSHDUP m256 _mm256_movehdup_ps ( m256 a); VMOVSHDUP m128 _mm_movehdup_ps ( m128 a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; EVEX-encoded instruction, see Exceptions Type E4NF.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVSLDUP—Replicate Single FP Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 12 /r

MOVSLDUP xmm1, xmm2/m128

A

V/V

SSE3

Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.

VEX.128.F3.0F.WIG 12 /r

VMOVSLDUP xmm1, xmm2/m128

A

V/V

AVX

Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1.

VEX.256.F3.0F.WIG 12 /r

VMOVSLDUP ymm1, ymm2/m256

A

V/V

AVX

Move even index single-precision floating-point values from ymm2/mem and duplicate each element into ymm1.

EVEX.128.F3.0F.W0 12 /r VMOVSLDUP xmm1 {k1}{z},

xmm2/m128

B

V/V

AVX512VL AVX512F

Move even index single-precision floating-point values from xmm2/m128 and duplicate each element into xmm1 under writemask.

EVEX.256.F3.0F.W0 12 /r VMOVSLDUP ymm1 {k1}{z},

ymm2/m256

B

V/V

AVX512VL AVX512F

Move even index single-precision floating-point values from ymm2/m256 and duplicate each element into ymm1 under writemask.

EVEX.512.F3.0F.W0 12 /r VMOVSLDUP zmm1 {k1}{z},

zmm2/m512

B

V/V

AVX512F

Move even index single-precision floating-point values from zmm2/m512 and duplicate each element into zmm1 under writemask.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Duplicates even-indexed single-precision floating-point values from the source operand (the second operand). See Figure 4-4. The source operand is an XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, YMM or ZMM register.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged. VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.

VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.

EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.


image

X0

X1

X2

X3

X4

X5

X6

X7

SRC



X6

X6

X4

X4

X2

X2

X0

X0

DEST


Figure 4-4. MOVSLDUP Operation



Operation

VMOVSLDUP (EVEX encoded versions) (KL, VL) = (4, 128), (8, 256), (16, 512) TMP_SRC[31:0] SRC[31:0] TMP_SRC[63:32] SRC[31:0] TMP_SRC[95:64] SRC[95:64] TMP_SRC[127:96] SRC[95:64]

IF VL >= 256

TMP_SRC[159:128] SRC[159:128] TMP_SRC[191:160] SRC[159:128] TMP_SRC[223:192] SRC[223:192] TMP_SRC[255:224] SRC[223:192]

FI;

IF VL >= 512

TMP_SRC[287:256] SRC[287:256] TMP_SRC[319:288] SRC[287:256] TMP_SRC[351:320] SRC[351:320] TMP_SRC[383:352] SRC[351:320] TMP_SRC[415:384] SRC[415:384] TMP_SRC[447:416] SRC[415:384] TMP_SRC[479:448] SRC[479:448] TMP_SRC[511:480] SRC[479:448]

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVSLDUP (VEX.256 encoded version)

DEST[31:0] SRC[31:0] DEST[63:32] SRC[31:0] DEST[95:64] SRC[95:64] DEST[127:96] SRC[95:64] DEST[159:128] SRC[159:128] DEST[191:160] SRC[159:128] DEST[223:192] SRC[223:192] DEST[255:224] SRC[223:192] DEST[MAXVL-1:256] 0


VMOVSLDUP (VEX.128 encoded version)

DEST[31:0] SRC[31:0] DEST[63:32] SRC[31:0] DEST[95:64] SRC[95:64] DEST[127:96] SRC[95:64] DEST[MAXVL-1:128] 0


4-118 Vol. 2B MOVSLDUP—Replicate Single FP Values



MOVSLDUP (128-bit Legacy SSE version)

DEST[31:0] SRC[31:0] DEST[63:32] SRC[31:0] DEST[95:64] SRC[95:64] DEST[127:96] SRC[95:64]

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMOVSLDUP m512 _mm512_moveldup_ps( m512 a);

VMOVSLDUP m512 _mm512_mask_moveldup_ps( m512 s, mmask16 k, m512 a); VMOVSLDUP m512 _mm512_maskz_moveldup_ps( mmask16 k, m512 a); VMOVSLDUP m256 _mm256_mask_moveldup_ps( m256 s, mmask8 k, m256 a); VMOVSLDUP m256 _mm256_maskz_moveldup_ps( mmask8 k, m256 a); VMOVSLDUP m128 _mm_mask_moveldup_ps( m128 s, mmask8 k, m128 a); VMOVSLDUP m128 _mm_maskz_moveldup_ps( mmask8 k, m128 a);

VMOVSLDUP m256 _mm256_moveldup_ps ( m256 a); VMOVSLDUP m128 _mm_moveldup_ps ( m128 a);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4; EVEX-encoded instruction, see Exceptions Type E4NF.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVSS—Move or Merge Scalar Single-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 10 /r

MOVSS xmm1, xmm2

A

V/V

SSE

Merge scalar single-precision floating-point value from xmm2 to xmm1 register.

F3 0F 10 /r

MOVSS xmm1, m32

A

V/V

SSE

Load scalar single-precision floating-point value from m32 to xmm1 register.

VEX.NDS.LIG.F3.0F.WIG 10 /r

VMOVSS xmm1, xmm2, xmm3

B

V/V

AVX

Merge scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register

VEX.LIG.F3.0F.WIG 10 /r VMOVSS xmm1, m32

D

V/V

AVX

Load scalar single-precision floating-point value from m32 to xmm1 register.

F3 0F 11 /r

MOVSS xmm2/m32, xmm1

C

V/V

SSE

Move scalar single-precision floating-point value from xmm1 register to xmm2/m32.

VEX.NDS.LIG.F3.0F.WIG 11 /r

VMOVSS xmm1, xmm2, xmm3

E

V/V

AVX

Move scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register.

VEX.LIG.F3.0F.WIG 11 /r VMOVSS m32, xmm1

C

V/V

AVX

Move scalar single-precision floating-point value from xmm1 register to m32.

EVEX.NDS.LIG.F3.0F.W0 10 /r

VMOVSS xmm1 {k1}{z}, xmm2, xmm3

B

V/V

AVX512F

Move scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register under writemask k1.

EVEX.LIG.F3.0F.W0 10 /r VMOVSS xmm1 {k1}{z}, m32

F

V/V

AVX512F

Move scalar single-precision floating-point values from m32 to xmm1 under writemask k1.

EVEX.NDS.LIG.F3.0F.W0 11 /r

VMOVSS xmm1 {k1}{z}, xmm2, xmm3

E

V/V

AVX512F

Move scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register under writemask k1.

EVEX.LIG.F3.0F.W0 11 /r VMOVSS m32 {k1}, xmm1

G

V/V

AVX512F

Move scalar single-precision floating-point values from xmm1 to m32 under writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

D

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

E

NA

ModRM:r/m (w)

vvvv (r)

ModRM:reg (r)

NA

F

Tuple1 Scalar

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

G

Tuple1 Scalar

ModRM:r/m (w)

ModRM:reg (r)

NA

NA



Description

Moves a scalar single-precision floating-point value from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations. This instruction can be used to move a single-precision floating-point value to and from the low doubleword of an XMM register and a 32-bit memory location, or to move a single-precision floating-point value between the low doublewords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

Legacy version: When the source and destination operands are XMM registers, bits (MAXVL-1:32) of the corre- sponding destination register are unmodified. When the source operand is a memory location and destination operand is an XMM registers, Bits (127:32) of the destination operand is cleared to all 0s, bits MAXVL:128 of the destination operand remains unchanged.

VEX and EVEX encoded register-register syntax: Moves a scalar single-precision floating-point value from the second source operand (the third operand) to the low doubleword element of the destination operand (the first operand). Bits 127:32 of the destination operand are copied from the first source operand (the second operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.

VEX and EVEX encoded memory load syntax: When the source operand is a memory location and destination operand is an XMM registers, bits MAXVL:32 of the destination operand is cleared to all 0s.

EVEX encoded versions: The low doubleword of the destination is updated according to the writemask.

Note: For memory store form instruction “VMOVSS m32, xmm1”, VEX.vvvv is reserved and must be 1111b other- wise instruction will #UD. For memory store form instruction “VMOVSS mv {k1}, xmm1”, EVEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

Software should ensure VMOVSS is encoded with VEX.L=0. Encoding VMOVSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.


Operation

VMOVSS (EVEX.NDS.LIG.F3.0F.W0 11 /r when the source operand is memory and the destination is an XMM register)

IF k1[0] or *no writemask*

THEN DEST[31:0] SRC[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0] 0

FI;

FI;

DEST[MAXVL-1:32] 0


VMOVSS (EVEX.NDS.LIG.F3.0F.W0 10 /r when the source operand is an XMM register and the destination is memory)

IF k1[0] or *no writemask*

THEN DEST[31:0] SRC[31:0]

ELSE *DEST[31:0] remains unchanged* ; merging-masking

FI;



VMOVSS (EVEX.NDS.LIG.F3.0F.W0 10/11 /r where the source and destination are XMM registers)

IF k1[0] or *no writemask*

THEN DEST[31:0] SRC2[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0] 0

FI;

FI;

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


MOVSS (Legacy SSE version when the source and destination operands are both XMM registers)

DEST[31:0] SRC[31:0]

DEST[MAXVL-1:32] (Unmodified)


VMOVSS (VEX.NDS.128.F3.0F 11 /r where the destination is an XMM register)

DEST[31:0] SRC2[31:0] DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


VMOVSS (VEX.NDS.128.F3.0F 10 /r where the source and destination are XMM registers)

DEST[31:0] SRC2[31:0] DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


VMOVSS (VEX.NDS.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register)

DEST[31:0] SRC[31:0] DEST[MAXVL-1:32] 0


MOVSS/VMOVSS (when the source operand is an XMM register and the destination is memory)

DEST[31:0] SRC[31:0]


MOVSS (Legacy SSE version when the source operand is memory and the destination is an XMM register)

DEST[31:0] SRC[31:0] DEST[127:32] 0

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMOVSS m128 _mm_mask_load_ss( m128 s, mmask8 k, float * p); VMOVSS m128 _mm_maskz_load_ss( mmask8 k, float * p);

VMOVSS m128 _mm_mask_move_ss( m128 sh, mmask8 k, m128 sl, m128 a); VMOVSS m128 _mm_maskz_move_ss( mmask8 k, m128 s, m128 a);

VMOVSS void _mm_mask_store_ss(float * p, mmask8 k, m128 a); MOVSS m128 _mm_load_ss(float * p)

MOVSS void_mm_store_ss(float * p, m128 a) MOVSS m128 _mm_move_ss( m128 a, m128 b)


SIMD Floating-Point Exceptions

None



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 5; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E10.


MOVSX/MOVSXD—Move with Sign-Extension

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F BE /r

MOVSX r16, r/m8

RM

Valid

Valid

Move byte to word with sign-extension.

0F BE /r

MOVSX r32, r/m8

RM

Valid

Valid

Move byte to doubleword with sign- extension.

REX.W + 0F BE /r

MOVSX r64, r/m8

RM

Valid

N.E.

Move byte to quadword with sign-extension.

0F BF /r

MOVSX r32, r/m16

RM

Valid

Valid

Move word to doubleword, with sign- extension.

REX.W + 0F BF /r

MOVSX r64, r/m16

RM

Valid

N.E.

Move word to quadword with sign-extension.

63 /r*

MOVSXD r16, r/m16

RM

Valid

Valid

Move word to word with sign-extension.

63 /r*

MOVSXD r32, r/m32

RM

Valid

Valid

Move doubleword to doubleword with sign- extension.

REX.W + 63 /r

MOVSXD r64, r/m32

RM

Valid

N.E.

Move doubleword to quadword with sign- extension.

NOTES:

* The use of MOVSXD without REX.W in 64-bit mode is discouraged. Regular MOV should be used instead of using MOVSXD without REX.W.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Copies the contents of the source operand (register or memory location) to the destination operand (register) and sign extends the value to 16 or 32 bits (see Figure 7-6 in the Intel® 64 and IA-32 Architectures Software Devel- oper’s Manual, Volume 1). The size of the converted value depends on the operand-size attribute.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.


Operation

DEST SignExtend(SRC);


Flags Affected

None.


Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.



Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 10 /r

MOVUPD xmm1, xmm2/m128

A

V/V

SSE2

Move unaligned packed double-precision floating- point from xmm2/mem to xmm1.

66 0F 11 /r

MOVUPD xmm2/m128, xmm1

B

V/V

SSE2

Move unaligned packed double-precision floating- point from xmm1 to xmm2/mem.

VEX.128.66.0F.WIG 10 /r

VMOVUPD xmm1, xmm2/m128

A

V/V

AVX

Move unaligned packed double-precision floating- point from xmm2/mem to xmm1.

VEX.128.66.0F.WIG 11 /r

VMOVUPD xmm2/m128, xmm1

B

V/V

AVX

Move unaligned packed double-precision floating- point from xmm1 to xmm2/mem.

VEX.256.66.0F.WIG 10 /r

VMOVUPD ymm1, ymm2/m256

A

V/V

AVX

Move unaligned packed double-precision floating- point from ymm2/mem to ymm1.

VEX.256.66.0F.WIG 11 /r

VMOVUPD ymm2/m256, ymm1

B

V/V

AVX

Move unaligned packed double-precision floating- point from ymm1 to ymm2/mem.

EVEX.128.66.0F.W1 10 /r

VMOVUPD xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512F

Move unaligned packed double-precision floating- point from xmm2/m128 to xmm1 using writemask k1.

EVEX.128.66.0F.W1 11 /r

VMOVUPD xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move unaligned packed double-precision floating- point from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.66.0F.W1 10 /r

VMOVUPD ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move unaligned packed double-precision floating- point from ymm2/m256 to ymm1 using writemask k1.

EVEX.256.66.0F.W1 11 /r

VMOVUPD ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move unaligned packed double-precision floating- point from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.66.0F.W1 10 /r

VMOVUPD zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move unaligned packed double-precision floating- point values from zmm2/m512 to zmm1 using writemask k1.

EVEX.512.66.0F.W1 11 /r

VMOVUPD zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move unaligned packed double-precision floating- point values from zmm1 to zmm2/m512 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a ZMM register from a float64 memory location, to store the contents of a ZMM register into a memory. The destination operand is updated according to the writemask.



VEX.256 encoded version:

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. Bits (MAXVL-1:256) of the destination register are zeroed.


128-bit versions:

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte boundary without causing a general-protection exception (#GP) to be generated

VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.


Operation

VMOVUPD (EVEX encoded versions, register-copy form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVUPD (EVEX encoded versions, store-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i]

ELSE *DEST[i+63:i] remains unchanged* ; merging-masking


FI;

ENDFOR;



VMOVUPD (EVEX encoded versions, load-form)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] SRC[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE DEST[i+63:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVUPD (VEX.256 encoded version, load - and register copy) DEST[255:0] SRC[255:0]

DEST[MAXVL-1:256] 0


VMOVUPD (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]


VMOVUPD (VEX.128 encoded version) DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


MOVUPD (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)


(V)MOVUPD (128-bit store-form version)

DEST[127:0] SRC[127:0]


Intel C/C++ Compiler Intrinsic Equivalent

VMOVUPD m512d _mm512_loadu_pd( void * s);

VMOVUPD m512d _mm512_mask_loadu_pd( m512d a, mmask8 k, void * s); VMOVUPD m512d _mm512_maskz_loadu_pd( mmask8 k, void * s); VMOVUPD void _mm512_storeu_pd( void * d, m512d a);

VMOVUPD void _mm512_mask_storeu_pd( void * d, mmask8 k, m512d a); VMOVUPD m256d _mm256_mask_loadu_pd( m256d s, mmask8 k, void * m); VMOVUPD m256d _mm256_maskz_loadu_pd( mmask8 k, void * m); VMOVUPD void _mm256_mask_storeu_pd( void * d, mmask8 k, m256d a); VMOVUPD m128d _mm_mask_loadu_pd( m128d s, mmask8 k, void * m); VMOVUPD m128d _mm_maskz_loadu_pd( mmask8 k, void * m);

VMOVUPD void _mm_mask_storeu_pd( void * d, mmask8 k, m128d a); MOVUPD m256d _mm256_loadu_pd (double * p);

MOVUPD void _mm256_storeu_pd( double *p, m256d a); MOVUPD m128d _mm_loadu_pd (double * p);

MOVUPD void _mm_storeu_pd( double *p, m128d a);


SIMD Floating-Point Exceptions

None



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4. Note treatment of #AC varies; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instruction, see Exceptions Type E4.nb.


MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 10 /r

MOVUPS xmm1, xmm2/m128

A

V/V

SSE

Move unaligned packed single-precision floating-point from xmm2/mem to xmm1.

NP 0F 11 /r

MOVUPS xmm2/m128, xmm1

B

V/V

SSE

Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem.

VEX.128.0F.WIG 10 /r

VMOVUPS xmm1, xmm2/m128

A

V/V

AVX

Move unaligned packed single-precision floating-point from xmm2/mem to xmm1.

VEX.128.0F.WIG 11 /r

VMOVUPS xmm2/m128, xmm1

B

V/V

AVX

Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem.

VEX.256.0F.WIG 10 /r

VMOVUPS ymm1, ymm2/m256

A

V/V

AVX

Move unaligned packed single-precision floating-point from ymm2/mem to ymm1.

VEX.256.0F.WIG 11 /r

VMOVUPS ymm2/m256, ymm1

B

V/V

AVX

Move unaligned packed single-precision floating-point from ymm1 to ymm2/mem.

EVEX.128.0F.W0 10 /r

VMOVUPS xmm1 {k1}{z}, xmm2/m128

C

V/V

AVX512VL AVX512F

Move unaligned packed single-precision floating-point values from xmm2/m128 to xmm1 using writemask k1.

EVEX.256.0F.W0 10 /r

VMOVUPS ymm1 {k1}{z}, ymm2/m256

C

V/V

AVX512VL AVX512F

Move unaligned packed single-precision floating-point values from ymm2/m256 to ymm1 using writemask k1.

EVEX.512.0F.W0 10 /r

VMOVUPS zmm1 {k1}{z}, zmm2/m512

C

V/V

AVX512F

Move unaligned packed single-precision floating-point values from zmm2/m512 to zmm1 using writemask k1.

EVEX.128.0F.W0 11 /r

VMOVUPS xmm2/m128 {k1}{z}, xmm1

D

V/V

AVX512VL AVX512F

Move unaligned packed single-precision floating-point values from xmm1 to xmm2/m128 using writemask k1.

EVEX.256.0F.W0 11 /r

VMOVUPS ymm2/m256 {k1}{z}, ymm1

D

V/V

AVX512VL AVX512F

Move unaligned packed single-precision floating-point values from ymm1 to ymm2/m256 using writemask k1.

EVEX.512.0F.W0 11 /r

VMOVUPS zmm2/m512 {k1}{z}, zmm1

D

V/V

AVX512F

Move unaligned packed single-precision floating-point values from zmm1 to zmm2/m512 using writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:r/m (w)

ModRM:reg (r)

NA

NA

C

Full Mem

ModRM:reg (w)

ModRM:r/m (r)

NA

NA

D

Full Mem

ModRM:r/m (w)

ModRM:reg (r)

NA

NA


Description

Note: VEX.vvvv and EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

EVEX.512 encoded version:

Moves 512 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a ZMM register from a 512-bit float32 memory location, to store the contents of a ZMM register into memory. The destination operand is updated according to the writemask.



VEX.256 and EVEX.256 encoded versions:

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. Bits (MAXVL-1:256) of the destination register are zeroed.


128-bit versions:

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned without causing a general-protection exception (#GP) to be generated.

VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the destination register are zeroed.


Operation

VMOVUPS (EVEX encoded versions, register-copy form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVUPS (EVEX encoded versions, store-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i]

ELSE *DEST[i+31:i] remains unchanged* ; merging-masking

FI;

ENDFOR;



VMOVUPS (EVEX encoded versions, load-form)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] SRC[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE DEST[i+31:i] 0 ; zeroing-masking

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMOVUPS (VEX.256 encoded version, load - and register copy)

DEST[255:0] SRC[255:0] DEST[MAXVL-1:256] 0


VMOVUPS (VEX.256 encoded version, store-form)

DEST[255:0] SRC[255:0]


VMOVUPS (VEX.128 encoded version) DEST[127:0] SRC[127:0] DEST[MAXVL-1:128] 0


MOVUPS (128-bit load- and register-copy- form Legacy SSE version)

DEST[127:0] SRC[127:0]

DEST[MAXVL-1:128] (Unmodified)


(V)MOVUPS (128-bit store-form version)

DEST[127:0] SRC[127:0]


Intel C/C++ Compiler Intrinsic Equivalent

VMOVUPS m512 _mm512_loadu_ps( void * s);

VMOVUPS m512 _mm512_mask_loadu_ps( m512 a, mmask16 k, void * s); VMOVUPS m512 _mm512_maskz_loadu_ps( mmask16 k, void * s); VMOVUPS void _mm512_storeu_ps( void * d, m512 a);

VMOVUPS void _mm512_mask_storeu_ps( void * d, mmask8 k, m512 a); VMOVUPS m256 _mm256_mask_loadu_ps( m256 a, mmask8 k, void * s); VMOVUPS m256 _mm256_maskz_loadu_ps( mmask8 k, void * s); VMOVUPS void _mm256_mask_storeu_ps( void * d, mmask8 k, m256 a); VMOVUPS m128 _mm_mask_loadu_ps( m128 a, mmask8 k, void * s); VMOVUPS m128 _mm_maskz_loadu_ps( mmask8 k, void * s);

VMOVUPS void _mm_mask_storeu_ps( void * d, mmask8 k, m128 a); MOVUPS m256 _mm256_loadu_ps ( float * p);

MOVUPS void _mm256 _storeu_ps( float *p, m256 a); MOVUPS m128 _mm_loadu_ps ( float * p);

MOVUPS void _mm_storeu_ps( float *p, m128 a);


SIMD Floating-Point Exceptions

None



Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 4. Note treatment of #AC varies;

EVEX-encoded instruction, see Exceptions Type E4.nb.

#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.


MOVZX—Move with Zero-Extend

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F B6 /r

MOVZX r16, r/m8

RM

Valid

Valid

Move byte to word with zero-extension.

0F B6 /r

MOVZX r32, r/m8

RM

Valid

Valid

Move byte to doubleword, zero-extension.

REX.W + 0F B6 /r

MOVZX r64, r/m8*

RM

Valid

N.E.

Move byte to quadword, zero-extension.

0F B7 /r

MOVZX r32, r/m16

RM

Valid

Valid

Move word to doubleword, zero-extension.

REX.W + 0F B7 /r

MOVZX r64, r/m16

RM

Valid

N.E.

Move word to quadword, zero-extension.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if the REX prefix is used: AH, BH, CH, DH.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RM

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Copies the contents of the source operand (register or memory location) to the destination operand (register) and zero extends the value. The size of the converted value depends on the operand-size attribute.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bit operands. See the summary chart at the beginning of this section for encoding data and limits.


Operation

DEST ZeroExtend(SRC);


Flags Affected

None.


Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.



Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


MPSADBW — Compute Multiple Packed Sums of Absolute Difference

Opcode/ Instruction

Op/ En

64/32-bit Mode

CPUID

Feature Flag

Description

66 0F 3A 42 /r ib

MPSADBW xmm1, xmm2/m128, imm8

RMI

V/V

SSE4_1

Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers in xmm1 and xmm2/m128 and writes the results in xmm1. Starting offsets within xmm1 and xmm2/m128 are determined by imm8.

VEX.NDS.128.66.0F3A.WIG 42 /r ib

VMPSADBW xmm1, xmm2, xmm3/m128, imm8

RVMI

V/V

AVX

Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers in xmm2 and xmm3/m128 and writes the results in xmm1. Starting offsets within xmm2 and xmm3/m128 are determined by imm8.

VEX.NDS.256.66.0F3A.WIG 42 /r ib

VMPSADBW ymm1, ymm2, ymm3/m256, imm8

RVMI

V/V

AVX2

Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers in xmm2 and ymm3/m128 and writes the results in ymm1. Starting offsets within ymm2 and xmm3/m128 are determined by imm8.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RMI

ModRM:reg (r, w)

ModRM:r/m (r)

imm8

NA

RVMI

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

imm8


Description

(V)MPSADBW calculates packed word results of sum-absolute-difference (SAD) of unsigned bytes from two blocks of 32-bit dword elements, using two select fields in the immediate byte to select the offsets of the two blocks within the first source operand and the second operand. Packed SAD word results are calculated within each 128-bit lane. Each SAD word result is calculated between a stationary block_2 (whose offset within the second source operand is selected by a two bit select control, multiplied by 32 bits) and a sliding block_1 at consecutive byte-granular posi- tion within the first source operand. The offset of the first 32-bit block of block_1 is selectable using a one bit select control, multiplied by 32 bits.

128-bit Legacy SSE version: Imm8[1:0]*32 specifies the bit offset of block_2 within the second source operand. Imm[2]*32 specifies the initial bit offset of the block_1 within the first source operand. The first source operand and destination operand are the same. The first source and destination operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged. Bits 7:3 of the immediate byte are ignored.

VEX.128 encoded version: Imm8[1:0]*32 specifies the bit offset of block_2 within the second source operand. Imm[2]*32 specifies the initial bit offset of the block_1 within the first source operand. The first source and desti- nation operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory location. Bits (127:128) of the corresponding YMM register are zeroed. Bits 7:3 of the immediate byte are ignored.

VEX.256 encoded version: The sum-absolute-difference (SAD) operation is repeated 8 times for MPSADW between the same block_2 (fixed offset within the second source operand) and a variable block_1 (offset is shifted by 8 bits for each SAD operation) in the first source operand. Each 16-bit result of eight SAD operations between block_2 and block_1 is written to the respective word in the lower 128 bits of the destination operand.

Additionally, VMPSADBW performs another eight SAD operations on block_4 of the second source operand and block_3 of the first source operand. (Imm8[4:3]*32 + 128) specifies the bit offset of block_4 within the second source operand. (Imm[5]*32+128) specifies the initial bit offset of the block_3 within the first source operand. Each 16-bit result of eight SAD operations between block_4 and block_3 is written to the respective word in the upper 128 bits of the destination operand.



The first source operand is a YMM register. The second source register can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. Bits 7:6 of the immediate byte are ignored.

image

128

192

224

Note: If VMPSADBW is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.



255

Imm[4:3]*32+128


Src2

Abs. Diff.

Imm[5]*32+128


Src1

Sum


255

144

128


Destination


127

Imm[1:0]*32


Src2

Abs. Diff.

Imm[2]*32


Src1

Sum


127

16

0


Destination


0

64

96

Figure 4-5. 256-bit VMPSADBW Operation



Operation

VMPSADBW (VEX.256 encoded version) BLK2_OFFSET imm8[1:0]*32 BLK1_OFFSET imm8[2]*32

SRC1_BYTE0 SRC1[BLK1_OFFSET+7:BLK1_OFFSET] SRC1_BYTE1 SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8] SRC1_BYTE2 SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16] SRC1_BYTE3 SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24] SRC1_BYTE4 SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32] SRC1_BYTE5 SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40] SRC1_BYTE6 SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48] SRC1_BYTE7 SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56] SRC1_BYTE8 SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64] SRC1_BYTE9 SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72] SRC1_BYTE10 SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80] SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET] SRC2_BYTE1 SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8] SRC2_BYTE2 SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16] SRC2_BYTE3 SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]


TEMP0 ABS(SRC1_BYTE0 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE1 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE2 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE3 - SRC2_BYTE3) DEST[15:0] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE1 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE2 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE3 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE4 - SRC2_BYTE3)

DEST[31:16] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE2 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE3 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE4 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE5 - SRC2_BYTE3)

DEST[47:32] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE3 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE4 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE5 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE6 - SRC2_BYTE3)

DEST[63:48] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE4 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE5 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE6 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE7 - SRC2_BYTE3)

DEST[79:64] TEMP0 + TEMP1 + TEMP2 + TEMP3



TEMP0 ABS(SRC1_BYTE5 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE6 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE7 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE8 - SRC2_BYTE3) DEST[95:80] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE6 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE7 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE8 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[111:96] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE7 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE8 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE9 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[127:112] TEMP0 + TEMP1 + TEMP2 + TEMP3


BLK2_OFFSET imm8[4:3]*32 + 128 BLK1_OFFSET imm8[5]*32 + 128

SRC1_BYTE0 SRC1[BLK1_OFFSET+7:BLK1_OFFSET] SRC1_BYTE1 SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8] SRC1_BYTE2 SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16] SRC1_BYTE3 SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24] SRC1_BYTE4 SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32] SRC1_BYTE5 SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40] SRC1_BYTE6 SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48] SRC1_BYTE7 SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56] SRC1_BYTE8 SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64] SRC1_BYTE9 SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72] SRC1_BYTE10 SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80]


SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET] SRC2_BYTE1 SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8] SRC2_BYTE2 SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16] SRC2_BYTE3 SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]


TEMP0 ABS(SRC1_BYTE0 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE1 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE2 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE3 - SRC2_BYTE3)

DEST[143:128] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE1 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE2 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE3 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE4 - SRC2_BYTE3)

DEST[159:144] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE2 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE3 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE4 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE5 - SRC2_BYTE3)

DEST[175:160] TEMP0 + TEMP1 + TEMP2 + TEMP3



TEMP0 ABS(SRC1_BYTE3 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE4 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE5 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE6 - SRC2_BYTE3)

DEST[191:176] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE4 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE5 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE6 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE7 - SRC2_BYTE3)

DEST[207:192] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE5 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE6 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE7 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE8 - SRC2_BYTE3)

DEST[223:208] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE6 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE7 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE8 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[239:224] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE7 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE8 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE9 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[255:240] TEMP0 + TEMP1 + TEMP2 + TEMP3


VMPSADBW (VEX.128 encoded version) BLK2_OFFSET imm8[1:0]*32 BLK1_OFFSET imm8[2]*32

SRC1_BYTE0 SRC1[BLK1_OFFSET+7:BLK1_OFFSET] SRC1_BYTE1 SRC1[BLK1_OFFSET+15:BLK1_OFFSET+8] SRC1_BYTE2 SRC1[BLK1_OFFSET+23:BLK1_OFFSET+16] SRC1_BYTE3 SRC1[BLK1_OFFSET+31:BLK1_OFFSET+24] SRC1_BYTE4 SRC1[BLK1_OFFSET+39:BLK1_OFFSET+32] SRC1_BYTE5 SRC1[BLK1_OFFSET+47:BLK1_OFFSET+40] SRC1_BYTE6 SRC1[BLK1_OFFSET+55:BLK1_OFFSET+48] SRC1_BYTE7 SRC1[BLK1_OFFSET+63:BLK1_OFFSET+56] SRC1_BYTE8 SRC1[BLK1_OFFSET+71:BLK1_OFFSET+64] SRC1_BYTE9 SRC1[BLK1_OFFSET+79:BLK1_OFFSET+72] SRC1_BYTE10 SRC1[BLK1_OFFSET+87:BLK1_OFFSET+80]


SRC2_BYTE0 SRC2[BLK2_OFFSET+7:BLK2_OFFSET] SRC2_BYTE1 SRC2[BLK2_OFFSET+15:BLK2_OFFSET+8] SRC2_BYTE2 SRC2[BLK2_OFFSET+23:BLK2_OFFSET+16] SRC2_BYTE3 SRC2[BLK2_OFFSET+31:BLK2_OFFSET+24]



TEMP0 ABS(SRC1_BYTE0 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE1 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE2 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE3 - SRC2_BYTE3) DEST[15:0] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE1 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE2 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE3 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE4 - SRC2_BYTE3) DEST[31:16] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE2 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE3 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE4 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE5 - SRC2_BYTE3) DEST[47:32] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE3 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE4 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE5 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE6 - SRC2_BYTE3) DEST[63:48] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE4 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE5 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE6 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE7 - SRC2_BYTE3) DEST[79:64] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE5 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE6 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE7 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE8 - SRC2_BYTE3) DEST[95:80] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE6 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE7 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE8 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE9 - SRC2_BYTE3)

DEST[111:96] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS(SRC1_BYTE7 - SRC2_BYTE0) TEMP1 ABS(SRC1_BYTE8 - SRC2_BYTE1) TEMP2 ABS(SRC1_BYTE9 - SRC2_BYTE2) TEMP3 ABS(SRC1_BYTE10 - SRC2_BYTE3)

DEST[127:112] TEMP0 + TEMP1 + TEMP2 + TEMP3 DEST[MAXVL-1:128] 0



MPSADBW (128-bit Legacy SSE version) SRC_OFFSET imm8[1:0]*32 DEST_OFFSET imm8[2]*32

DEST_BYTE0 DEST[DEST_OFFSET+7:DEST_OFFSET] DEST_BYTE1 DEST[DEST_OFFSET+15:DEST_OFFSET+8] DEST_BYTE2 DEST[DEST_OFFSET+23:DEST_OFFSET+16] DEST_BYTE3 DEST[DEST_OFFSET+31:DEST_OFFSET+24] DEST_BYTE4 DEST[DEST_OFFSET+39:DEST_OFFSET+32] DEST_BYTE5 DEST[DEST_OFFSET+47:DEST_OFFSET+40] DEST_BYTE6 DEST[DEST_OFFSET+55:DEST_OFFSET+48] DEST_BYTE7 DEST[DEST_OFFSET+63:DEST_OFFSET+56] DEST_BYTE8 DEST[DEST_OFFSET+71:DEST_OFFSET+64] DEST_BYTE9 DEST[DEST_OFFSET+79:DEST_OFFSET+72] DEST_BYTE10 DEST[DEST_OFFSET+87:DEST_OFFSET+80]


SRC_BYTE0 SRC[SRC_OFFSET+7:SRC_OFFSET] SRC_BYTE1 SRC[SRC_OFFSET+15:SRC_OFFSET+8] SRC_BYTE2 SRC[SRC_OFFSET+23:SRC_OFFSET+16] SRC_BYTE3 SRC[SRC_OFFSET+31:SRC_OFFSET+24]


TEMP0 ABS( DEST_BYTE0 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE1 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE2 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE3 - SRC_BYTE3) DEST[15:0] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE1 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE2 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE3 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE4 - SRC_BYTE3)

DEST[31:16] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE2 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE3 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE4 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE5 - SRC_BYTE3)

DEST[47:32] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE3 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE4 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE5 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE6 - SRC_BYTE3)

DEST[63:48] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE4 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE5 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE6 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE7 - SRC_BYTE3)

DEST[79:64] TEMP0 + TEMP1 + TEMP2 + TEMP3



TEMP0 ABS( DEST_BYTE5 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE6 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE7 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE8 - SRC_BYTE3)

DEST[95:80] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE6 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE7 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE8 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE9 - SRC_BYTE3)

DEST[111:96] TEMP0 + TEMP1 + TEMP2 + TEMP3


TEMP0 ABS( DEST_BYTE7 - SRC_BYTE0) TEMP1 ABS( DEST_BYTE8 - SRC_BYTE1) TEMP2 ABS( DEST_BYTE9 - SRC_BYTE2) TEMP3 ABS( DEST_BYTE10 - SRC_BYTE3)

DEST[127:112] TEMP0 + TEMP1 + TEMP2 + TEMP3

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

(V)MPSADBW: m128i _mm_mpsadbw_epu8 ( m128i s1, m128i s2, const int mask);

VMPSADBW: m256i _mm256_mpsadbw_epu8 ( m256i s1, m256i s2, const int mask);


Flags Affected

None


Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.


MUL—Unsigned Multiply

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

F6 /4

MUL r/m8

M

Valid

Valid

Unsigned multiply (AX AL r/m8).

REX + F6 /4

MUL r/m8*

M

Valid

N.E.

Unsigned multiply (AX AL r/m8).

F7 /4

MUL r/m16

M

Valid

Valid

Unsigned multiply (DX:AX AX r/m16).

F7 /4

MUL r/m32

M

Valid

Valid

Unsigned multiply (EDX:EAX EAX r/m32).

REX.W + F7 /4

MUL r/m64

M

Valid

N.E.

Unsigned multiply (RDX:RAX RAX r/m64).

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

M

ModRM:r/m (r)

NA

NA

NA


Description

Performs an unsigned multiplication of the first operand (destination operand) and the second operand (source operand) and stores the result in the destination operand. The destination operand is an implied operand located in register AL, AX or EAX (depending on the size of the operand); the source operand is located in a general-purpose register or a memory location. The action of this instruction and the location of the result depends on the opcode and the operand size as shown in Table 4-9.

The result is stored in register AX, register pair DX:AX, or register pair EDX:EAX (depending on the operand size), with the high-order bits of the product contained in register AH, DX, or EDX, respectively. If the high-order bits of the product are 0, the CF and OF flags are cleared; otherwise, the flags are set.

In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to addi- tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.

See the summary chart at the beginning of this section for encoding data and limits.


Table 4-9. MUL Results

Operand Size

Source 1

Source 2

Destination

Byte

AL

r/m8

AX

Word

AX

r/m16

DX:AX

Doubleword

EAX

r/m32

EDX:EAX

Quadword

RAX

r/m64

RDX:RAX



Operation

IF (Byte operation) THEN

AX AL SRC;

ELSE (* Word or doubleword operation *) IF OperandSize 16

THEN

DX:AX AX SRC;

ELSE IF OperandSize 32

THEN EDX:EAX EAX SRC; FI;

ELSE (* OperandSize 64 *) RDX:RAX RAX SRC;

FI;

FI;


Flags Affected

The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1. The SF, ZF, AF, and PF flags are undefined.


Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.


MULPD—Multiply Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 59 /r

MULPD xmm1, xmm2/m128

A

V/V

SSE2

Multiply packed double-precision floating-point values in xmm2/m128 with xmm1 and store result in xmm1.

VEX.NDS.128.66.0F.WIG 59 /r

VMULPD xmm1,xmm2, xmm3/m128

B

V/V

AVX

Multiply packed double-precision floating-point values in xmm3/m128 with xmm2 and store result in xmm1.

VEX.NDS.256.66.0F.WIG 59 /r

VMULPD ymm1, ymm2, ymm3/m256

B

V/V

AVX

Multiply packed double-precision floating-point values in ymm3/m256 with ymm2 and store result in ymm1.

EVEX.NDS.128.66.0F.W1 59 /r

VMULPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Multiply packed double-precision floating-point values from xmm3/m128/m64bcst to xmm2 and store result in xmm1.

EVEX.NDS.256.66.0F.W1 59 /r

VMULPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Multiply packed double-precision floating-point values from ymm3/m256/m64bcst to ymm2 and store result in ymm1.

EVEX.NDS.512.66.0F.W1 59 /r

VMULPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}

C

V/V

AVX512F

Multiply packed double-precision floating-point values in zmm3/m512/m64bcst with zmm2 and store result in zmm1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Multiply packed double-precision floating-point values from the first source operand with corresponding values in the second source operand, and stores the packed double-precision floating-point results in the destination operand.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre- sponding destination ZMM register are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.



Operation

VMULPD (EVEX encoded versions)

(KL, VL) = (2, 128), (4, 256), (8, 512)

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register* THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+63:i] SRC1[i+63:i] * SRC2[63:0] ELSE

DEST[i+63:i] SRC1[i+63:i] * SRC2[i+63:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMULPD (VEX.256 encoded version) DEST[63:0] SRC1[63:0] * SRC2[63:0] DEST[127:64] SRC1[127:64] * SRC2[127:64]

DEST[191:128] SRC1[191:128] * SRC2[191:128] DEST[255:192] SRC1[255:192] * SRC2[255:192] DEST[MAXVL-1:256] 0;

.

VMULPD (VEX.128 encoded version) DEST[63:0] SRC1[63:0] * SRC2[63:0] DEST[127:64] SRC1[127:64] * SRC2[127:64] DEST[MAXVL-1:128] 0


MULPD (128-bit Legacy SSE version) DEST[63:0] DEST[63:0] * SRC[63:0] DEST[127:64] DEST[127:64] * SRC[127:64]

DEST[MAXVL-1:128] (Unmodified)



Intel C/C++ Compiler Intrinsic Equivalent

VMULPD m512d _mm512_mul_pd( m512d a, m512d b);

VMULPD m512d _mm512_mask_mul_pd( m512d s, mmask8 k, m512d a, m512d b); VMULPD m512d _mm512_maskz_mul_pd( mmask8 k, m512d a, m512d b);

VMULPD m512d _mm512_mul_round_pd( m512d a, m512d b, int);

VMULPD m512d _mm512_mask_mul_round_pd( m512d s, mmask8 k, m512d a, m512d b, int); VMULPD m512d _mm512_maskz_mul_round_pd( mmask8 k, m512d a, m512d b, int);

VMULPD m256d _mm256_mul_pd ( m256d a, m256d b); MULPD m128d _mm_mul_pd ( m128d a, m128d b);


SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MULPS—Multiply Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 59 /r

MULPS xmm1, xmm2/m128

A

V/V

SSE

Multiply packed single-precision floating-point values in xmm2/m128 with xmm1 and store result in xmm1.

VEX.NDS.128.0F.WIG 59 /r

VMULPS xmm1,xmm2, xmm3/m128

B

V/V

AVX

Multiply packed single-precision floating-point values in xmm3/m128 with xmm2 and store result in xmm1.

VEX.NDS.256.0F.WIG 59 /r

VMULPS ymm1, ymm2, ymm3/m256

B

V/V

AVX

Multiply packed single-precision floating-point values in ymm3/m256 with ymm2 and store result in ymm1.

EVEX.NDS.128.0F.W0 59 /r

VMULPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Multiply packed single-precision floating-point values from xmm3/m128/m32bcst to xmm2 and store result in xmm1.

EVEX.NDS.256.0F.W0 59 /r

VMULPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Multiply packed single-precision floating-point values from ymm3/m256/m32bcst to ymm2 and store result in ymm1.

EVEX.NDS.512.0F.W0 59 /r

VMULPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst {er}

C

V/V

AVX512F

Multiply packed single-precision floating-point values in zmm3/m512/m32bcst with zmm2 and store result in zmm1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Multiply the packed single-precision floating-point values from the first source operand with the corresponding values in the second source operand, and stores the packed double-precision floating-point results in the destina- tion operand.

EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre- sponding destination ZMM register are zeroed.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified.



Operation

VMULPS (EVEX encoded version)

(KL, VL) = (4, 128), (8, 256), (16, 512)

IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register* THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask* THEN

IF (EVEX.b = 1) AND (SRC2 *is memory*) THEN

DEST[i+31:i] SRC1[i+31:i] * SRC2[31:0] ELSE

DEST[i+31:i] SRC1[i+31:i] * SRC2[i+31:i]

FI;

ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE ; zeroing-masking

DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VMULPS (VEX.256 encoded version) DEST[31:0] SRC1[31:0] * SRC2[31:0] DEST[63:32] SRC1[63:32] * SRC2[63:32] DEST[95:64] SRC1[95:64] * SRC2[95:64] DEST[127:96] SRC1[127:96] * SRC2[127:96]

DEST[159:128] SRC1[159:128] * SRC2[159:128] DEST[191:160]SRC1[191:160] * SRC2[191:160] DEST[223:192] SRC1[223:192] * SRC2[223:192] DEST[255:224] SRC1[255:224] * SRC2[255:224]. DEST[MAXVL-1:256] 0;


VMULPS (VEX.128 encoded version) DEST[31:0] SRC1[31:0] * SRC2[31:0] DEST[63:32] SRC1[63:32] * SRC2[63:32] DEST[95:64] SRC1[95:64] * SRC2[95:64] DEST[127:96] SRC1[127:96] * SRC2[127:96] DEST[MAXVL-1:128] 0


MULPS (128-bit Legacy SSE version) DEST[31:0] SRC1[31:0] * SRC2[31:0] DEST[63:32] SRC1[63:32] * SRC2[63:32] DEST[95:64] SRC1[95:64] * SRC2[95:64] DEST[127:96] SRC1[127:96] * SRC2[127:96]

DEST[MAXVL-1:128] (Unmodified)



Intel C/C++ Compiler Intrinsic Equivalent

VMULPS m512 _mm512_mul_ps( m512 a, m512 b);

VMULPS m512 _mm512_mask_mul_ps( m512 s, mmask16 k, m512 a, m512 b); VMULPS m512 _mm512_maskz_mul_ps( mmask16 k, m512 a, m512 b);

VMULPS m512 _mm512_mul_round_ps( m512 a, m512 b, int);

VMULPS m512 _mm512_mask_mul_round_ps( m512 s, mmask16 k, m512 a, m512 b, int); VMULPS m512 _mm512_maskz_mul_round_ps( mmask16 k, m512 a, m512 b, int);

VMULPS m256 _mm256_mask_mul_ps( m256 s, mmask8 k, m256 a, m256 b); VMULPS m256 _mm256_maskz_mul_ps( mmask8 k, m256 a, m256 b);

VMULPS m128 _mm_mask_mul_ps( m128 s, mmask8 k, m128 a, m128 b); VMULPS m128 _mm_maskz_mul_ps( mmask8 k, m128 a, m128 b);

VMULPS m256 _mm256_mul_ps ( m256 a, m256 b); MULPS m128 _mm_mul_ps ( m128 a, m128 b);


SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 2. EVEX-encoded instruction, see Exceptions Type E2.


MULSD—Multiply Scalar Double-Precision Floating-Point Value

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F2 0F 59 /r

MULSD xmm1,xmm2/m64

A

V/V

SSE2

Multiply the low double-precision floating-point value in xmm2/m64 by low double-precision floating-point value in xmm1.

VEX.NDS.LIG.F2.0F.WIG 59 /r

VMULSD xmm1,xmm2, xmm3/m64

B

V/V

AVX

Multiply the low double-precision floating-point value in xmm3/m64 by low double-precision floating-point value in xmm2.

EVEX.NDS.LIG.F2.0F.W1 59 /r

VMULSD xmm1 {k1}{z}, xmm2, xmm3/m64 {er}

C

V/V

AVX512F

Multiply the low double-precision floating-point value in xmm3/m64 by low double-precision floating-point value in xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Multiplies the low double-precision floating-point value in the second source operand by the low double-precision floating-point value in the first source operand, and stores the double-precision floating-point result in the destina- tion operand. The second source operand can be an XMM register or a 64-bit memory location. The first source operand and the destination operands are XMM registers.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL- 1:64) of the corresponding destination register remain unchanged.

VEX.128 and EVEX encoded version: The quadword at bits 127:64 of the destination operand is copied from the same bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low quadword element of the destination operand is updated according to the writemask.

Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

VMULSD (EVEX encoded version)

IF (EVEX.b = 1) AND SRC2 *is a register* THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF k1[0] or *no writemask*

THEN DEST[63:0] SRC1[63:0] * SRC2[63:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[63:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[63:0] 0

FI

FI;

ENDFOR

DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


VMULSD (VEX.128 encoded version) DEST[63:0] SRC1[63:0] * SRC2[63:0] DEST[127:64] SRC1[127:64] DEST[MAXVL-1:128] 0


MULSD (128-bit Legacy SSE version)

DEST[63:0] DEST[63:0] * SRC[63:0]

DEST[MAXVL-1:64] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMULSD m128d _mm_mask_mul_sd( m128d s, mmask8 k, m128d a, m128d b); VMULSD m128d _mm_maskz_mul_sd( mmask8 k, m128d a, m128d b);

VMULSD m128d _mm_mul_round_sd( m128d a, m128d b, int);

VMULSD m128d _mm_mask_mul_round_sd( m128d s, mmask8 k, m128d a, m128d b, int); VMULSD m128d _mm_maskz_mul_round_sd( mmask8 k, m128d a, m128d b, int);

MULSD m128d _mm_mul_sd ( m128d a, m128d b)


SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3. EVEX-encoded instruction, see Exceptions Type E3.


MULSS—Multiply Scalar Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

F3 0F 59 /r

MULSS xmm1,xmm2/m32

A

V/V

SSE

Multiply the low single-precision floating-point value in xmm2/m32 by the low single-precision floating-point value in xmm1.

VEX.NDS.LIG.F3.0F.WIG 59 /r

VMULSS xmm1,xmm2, xmm3/m32

B

V/V

AVX

Multiply the low single-precision floating-point value in xmm3/m32 by the low single-precision floating-point value in xmm2.

EVEX.NDS.LIG.F3.0F.W0 59 /r

VMULSS xmm1 {k1}{z}, xmm2, xmm3/m32 {er}

C

V/V

AVX512F

Multiply the low single-precision floating-point value in xmm3/m32 by the low single-precision floating-point value in xmm2.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Tuple1 Scalar

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Multiplies the low single-precision floating-point value from the second source operand by the low single-precision floating-point value in the first source operand, and stores the single-precision floating-point result in the destina- tion operand. The second source operand can be an XMM register or a 32-bit memory location. The first source operand and the destination operands are XMM registers.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL- 1:32) of the corresponding YMM destination register remain unchanged.

VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three high-order doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.

EVEX encoded version: The low doubleword element of the destination operand is updated according to the writemask.

Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpre- dictable behavior across different processor generations.



Operation

VMULSS (EVEX encoded version)

IF (EVEX.b = 1) AND SRC2 *is a register* THEN

SET_RM(EVEX.RC);

ELSE

SET_RM(MXCSR.RM);

FI;

IF k1[0] or *no writemask*

THEN DEST[31:0] SRC1[31:0] * SRC2[31:0] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[31:0] remains unchanged*

ELSE ; zeroing-masking

THEN DEST[31:0] 0

FI

FI;

ENDFOR

DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


VMULSS (VEX.128 encoded version) DEST[31:0] SRC1[31:0] * SRC2[31:0] DEST[127:32] SRC1[127:32] DEST[MAXVL-1:128] 0


MULSS (128-bit Legacy SSE version)

DEST[31:0] DEST[31:0] * SRC[31:0]

DEST[MAXVL-1:32] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VMULSS m128 _mm_mask_mul_ss( m128 s, mmask8 k, m128 a, m128 b); VMULSS m128 _mm_maskz_mul_ss( mmask8 k, m128 a, m128 b);

VMULSS m128 _mm_mul_round_ss( m128 a, m128 b, int);

VMULSS m128 _mm_mask_mul_round_ss( m128 s, mmask8 k, m128 a, m128 b, int); VMULSS m128 _mm_maskz_mul_round_ss( mmask8 k, m128 a, m128 b, int);

MULSS m128 _mm_mul_ss( m128 a, m128 b)


SIMD Floating-Point Exceptions

Underflow, Overflow, Invalid, Precision, Denormal


Other Exceptions

Non-EVEX-encoded instruction, see Exceptions Type 3. EVEX-encoded instruction, see Exceptions Type E3.


MULX — Unsigned Multiply Without Affecting Flags

Opcode/ Instruction

Op/ En

64/32

-bit Mode

CPUID

Feature Flag

Description

VEX.NDD.LZ.F2.0F38.W0 F6 /r

MULX r32a, r32b, r/m32

RVM

V/V

BMI2

Unsigned multiply of r/m32 with EDX without affecting arithmetic flags.

VEX.NDD.LZ.F2.0F38.W1 F6 /r

MULX r64a, r64b, r/m64

RVM

V/N.E.

BMI2

Unsigned multiply of r/m64 with RDX without affecting arithmetic flags.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

RVM

ModRM:reg (w)

VEX.vvvv (w)

ModRM:r/m (r)

RDX/EDX is implied 64/32 bits source


Description

Performs an unsigned multiplication of the implicit source operand (EDX/RDX) and the specified source operand (the third operand) and stores the low half of the result in the second destination (second operand), the high half of the result in the first destination operand (first operand), without reading or writing the arithmetic flags. This enables efficient programming where the software can interleave add with carry operations and multiplications.

If the first and second operand are identical, it will contain the high half of the multiplication result.

This instruction is not supported in real mode and virtual-8086 mode. The operand size is always 32 bits if not in 64-bit mode. In 64-bit mode operand size 64 requires VEX.W1. VEX.W1 is ignored in non-64-bit modes. An attempt to execute this instruction with VEX.L not equal to 0 will cause #UD.


Operation

// DEST1: ModRM:reg

// DEST2: VEX.vvvv IF (OperandSize = 32)

SRC1 EDX;

DEST2 (SRC1*SRC2)[31:0]; DEST1 (SRC1*SRC2)[63:32];

ELSE IF (OperandSize = 64) SRC1 RDX;

DEST2 (SRC1*SRC2)[63:0]; DEST1 (SRC1*SRC2)[127:64];

FI


Flags Affected

None


Intel C/C++ Compiler Intrinsic Equivalent

Auto-generated from high-level language when possible.

unsigned int mulx_u32(unsigned int a, unsigned int b, unsigned int * hi);

unsigned int64 mulx_u64(unsigned int64 a, unsigned int64 b, unsigned int64 * hi);


SIMD Floating-Point Exceptions

None



Other Exceptions

See Section 2.5.1, “Exception Conditions for VEX-Encoded GPR Instructions”, Table 2-29; additionally

#UD If VEX.W = 1.


MWAIT—Monitor Wait

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F 01 C9

MWAIT

ZO

Valid

Valid

A hint that allows the processor to stop instruction execution and enter an implementation-dependent optimized state until occurrence of a class of events.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA


Description

MWAIT instruction provides hints to allow the processor to enter an implementation-dependent optimized state. There are two principal targeted usages: address-range monitor and advanced power management. Both usages of MWAIT require the use of the MONITOR instruction.

CPUID.01H:ECX.MONITOR[bit 3] indicates the availability of MONITOR and MWAIT in the processor. When set, MWAIT may be executed only at privilege level 0 (use at any other privilege level results in an invalid-opcode exception). The operating system or system BIOS may disable this instruction by using the IA32_MISC_ENABLE MSR; disabling MWAIT clears the CPUID feature flag and causes execution to generate an invalid-opcode excep- tion.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

ECX specifies optional extensions for the MWAIT instruction. EAX may contain hints such as the preferred optimized state the processor should enter. The first processors to implement MWAIT supported only the zero value for EAX and ECX. Later processors allowed setting ECX[0] to enable masked interrupts as break events for MWAIT (see below). Software can use the CPUID instruction to determine the extensions and hints supported by the processor.


MWAIT for Address Range Monitoring

For address-range monitoring, the MWAIT instruction operates with the MONITOR instruction. The two instructions allow the definition of an address at which to wait (MONITOR) and a implementation-dependent-optimized opera- tion to commence at the wait address (MWAIT). The execution of MWAIT is a hint to the processor that it can enter an implementation-dependent-optimized state while waiting for an event or a store operation to the address range armed by MONITOR.

The following cause the processor to exit the implementation-dependent-optimized state: a store to the address range armed by the MONITOR instruction, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause the processor to exit the implementation-dependent-optimized state.

In addition, an external interrupt causes the processor to exit the implementation-dependent-optimized state either (1) if the interrupt would be delivered to software (e.g., as it would be if HLT had been executed instead of MWAIT); or (2) if ECX[0] = 1. Software can execute MWAIT with ECX[0] = 1 only if CPUID.05H:ECX[bit 1] = 1. (Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation- dependent-optimized state even if interrupts are masked and ECX[0] = 0.)

Following exit from the implementation-dependent-optimized state, control passes to the instruction following the MWAIT instruction. A pending interrupt that is not masked (including an NMI or an SMI) may be delivered before execution of that instruction. Unlike the HLT instruction, the MWAIT instruction does not support a restart at the MWAIT instruction following the handling of an SMI.

If the preceding MONITOR instruction did not successfully arm an address range or if the MONITOR instruction has not been executed prior to executing MWAIT, then the processor will not enter the implementation-dependent-opti- mized state. Execution will resume at the instruction following the MWAIT.



MWAIT for Power Management

MWAIT accepts a hint and optional extension to the processor that it can enter a specified target C state while waiting for an event or a store operation to the address range armed by MONITOR. Support for MWAIT extensions for power management is indicated by CPUID.05H:ECX[bit 0] reporting 1.

EAX and ECX are used to communicate the additional information to the MWAIT instruction, such as the kind of optimized state the processor should enter. ECX specifies optional extensions for the MWAIT instruction. EAX may contain hints such as the preferred optimized state the processor should enter. Implementation-specific conditions may cause a processor to ignore the hint and enter a different optimized state. Future processor implementations may implement several optimized “waiting” states and will select among those states based on the hint argument.

Table 4-10 describes the meaning of ECX and EAX registers for MWAIT extensions.


Table 4-10. MWAIT Extension Register (ECX)

Bits

Description

0

Treat interrupts as break events even if masked (e.g., even if EFLAGS.IF=0). May be set only if CPUID.05H:ECX[bit 1] = 1.

31: 1

Reserved


Table 4-11. MWAIT Hints Register (EAX)

Bits

Description

3 : 0

Sub C-state within a C-state, indicated by bits [7:4]

7 : 4

Target C-state*

Value of 0 means C1; 1 means C2 and so on Value of 01111B means C0


Note: Target C states for MWAIT extensions are processor-specific C-states, not ACPI C-states

31: 8

Reserved

Note that if MWAIT is used to enter any of the C-states that are numerically higher than C1, a store to the address range armed by the MONITOR instruction will cause the processor to exit MWAIT only if the store was originated by other processor agents. A store from non-processor agent might not cause the processor to exit MWAIT in such cases.

For additional details of MWAIT extensions, see Chapter 14, “Power and Thermal Management,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.


Operation

(* MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension MWAIT EAX, ECX *)

{

WHILE ( (“Monitor Hardware is in armed state”)) { implementation_dependent_optimized_state(EAX, ECX); }

Set the state of Monitor Hardware as triggered;

}


Intel C/C Compiler Intrinsic Equivalent

MWAIT: void _mm_mwait(unsigned extensions, unsigned hints)



Example

MONITOR/MWAIT instruction pair must be coded in the same loop because execution of the MWAIT instruction will trigger the monitor hardware. It is not a proper usage to execute MONITOR once and then execute MWAIT in a loop. Setting up MONITOR without executing MWAIT has no adverse effects.

Typically the MONITOR/MWAIT pair is used in a sequence, such as:

EAX = Logical Address(Trigger) ECX = 0 (*Hints *)

EDX = 0 (* Hints *)

IF ( !trigger_store_happened) { MONITOR EAX, ECX, EDX

IF ( !trigger_store_happened ) { MWAIT EAX, ECX

}

}

The above code sequence makes sure that a triggering store does not happen between the first check of the trigger and the execution of the monitor instruction. Without the second check that triggering store would go un-noticed. Typical usage of MONITOR and MWAIT would have the above code sequence within a loop.


Numeric Exceptions

None


Protected Mode Exceptions

#GP(0) If ECX[31:1] 0.

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.

If current privilege level is not 0.


Real Address Mode Exceptions

#GP If ECX[31:1] 0.

If ECX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If CPUID.01H:ECX.MONITOR[bit 3] = 0.


Virtual 8086 Mode Exceptions

#UD The MWAIT instruction is not recognized in virtual-8086 mode (even if CPUID.01H:ECX.MONITOR[bit 3] = 1).


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#GP(0) If RCX[63:1] 0.

If RCX[0] = 1 and CPUID.05H:ECX[bit 1] = 0.

#UD If the current privilege level is not 0.

If CPUID.01H:ECX.MONITOR[bit 3] = 0.


NEG—Two's Complement Negation


Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

F6 /3

NEG r/m8

M

Valid

Valid

Two's complement negate r/m8.

REX + F6 /3

NEG r/m8*

M

Valid

N.E.

Two's complement negate r/m8.

F7 /3

NEG r/m16

M

Valid

Valid

Two's complement negate r/m16.

F7 /3

NEG r/m32

M

Valid

Valid

Two's complement negate r/m32.

REX.W + F7 /3

NEG r/m64

M

Valid

N.E.

Two's complement negate r/m64.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.



Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

M

ModRM:r/m (r, w)

NA

NA

NA


Description

Replaces the value of operand (the destination operand) with its two's complement. (This operation is equivalent to subtracting the operand from 0.) The destination operand is located in a general-purpose register or a memory location.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.


Operation

IF DEST 0

THEN CF 0; ELSE CF 1;

FI;

DEST [– (DEST)]


Flags Affected

The CF flag set to 0 if the source operand is 0; otherwise it is set to 1. The OF, SF, ZF, AF, and PF flags are set according to the result.


Protected Mode Exceptions

#GP(0) If the destination is located in a non-writable segment.

If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.



Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used but the destination is not a memory operand.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used but the destination is not a memory operand.


Compatibility Mode Exceptions

Same as for protected mode exceptions.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) For a page fault.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used but the destination is not a memory operand.


NOP—No Operation

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

NP 90

NOP

ZO

Valid

Valid

One byte no-operation instruction.

NP 0F 1F /0

NOP r/m16

M

Valid

Valid

Multi-byte no-operation instruction.

NP 0F 1F /0

NOP r/m32

M

Valid

Valid

Multi-byte no-operation instruction.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA

M

ModRM:r/m (r)

NA

NA

NA


Description

This instruction performs no operation. It is a one-byte or multi-byte NOP that takes up space in the instruction stream but does not impact machine context, except for the EIP register.

The multi-byte form of NOP is available on processors with model encoding:


image

1. Regardless of the value of R11, the RF and VM flags are always 0 in RFLAGS after execution of SYSRET. In addition, all reserved bits in RFLAGS retain the fixed values.



Operation

IF (CS.L 1 ) or (IA32_EFER.LMA 1) or (IA32_EFER.SCE 1)

(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *) THEN #UD; FI;

IF (CPL 0) THEN #GP(0); FI;

IF (operand size is 64-bit)

THEN (* Return to 64-Bit Mode *)

IF (RCX is not canonical) THEN #GP(0); RIP RCX;

ELSE (* Return to Compatibility Mode *) RIP ECX;

FI;

RFLAGS (R11 & 3C7FD7H) | 2; (* Clear RF, VM, reserved bits; set bit 2 *)

IF (operand size is 64-bit)

THEN CS.Selector IA32_STAR[63:48]+16; ELSE CS.Selector IA32_STAR[63:48];

FI;

CS.Selector CS.Selector OR 3; (* RPL forced to 3 *) (* Set rest of CS to a fixed value *)

CS.Base 0; (* Flat segment *)

CS.Limit FFFFFH; (* With 4-KByte granularity, implies a 4-GByte limit *)

CS.Type 11; (* Execute/read code, accessed *) CS.S 1;

CS.DPL 3;

CS.P 1;

IF (operand size is 64-bit)

THEN (* Return to 64-Bit Mode *)

CS.L 1; (* 64-bit code segment *)

CS.D 0; (* Required if CS.L = 1 *) ELSE (* Return to Compatibility Mode *)

CS.L 0; (* Compatibility mode *)

CS.D 1; (* 32-bit code segment *)

FI;

CS.G 1; (* 4-KByte granularity *)

CPL 3;

SS.Selector (IA32_STAR[63:48]+8) OR 3; (* RPL forced to 3 *) (* Set rest of SS to a fixed value *)

SS.Base 0; (* Flat segment *)

SS.Limit FFFFFH; (* With 4-KByte granularity, implies a 4-GByte limit *)

SS.Type 3; (* Read/write data, accessed *) SS.S 1;

SS.DPL 3;

SS.P 1;

SS.B 1; (* 32-bit stack segment*)

SS.G 1; (* 4-KByte granularity *)


Flags Affected

All.


Protected Mode Exceptions

#UD The SYSRET instruction is not recognized in protected mode.


SYSRET—Return From Fast System Call Vol. 2B 4-673



Real-Address Mode Exceptions

#UD The SYSRET instruction is not recognized in real-address mode.


Virtual-8086 Mode Exceptions

#UD The SYSRET instruction is not recognized in virtual-8086 mode.


Compatibility Mode Exceptions

#UD The SYSRET instruction is not recognized in compatibility mode.


64-Bit Mode Exceptions

#UD If IA32_EFER.SCE = 0.

If the LOCK prefix is used.

#GP(0) If CPL 0.

If the return is to 64-bit mode and RCX contains a non-canonical address.


TEST—Logical Compare


Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

A8 ib

TEST AL, imm8

I

Valid

Valid

AND imm8 with AL; set SF, ZF, PF according to result.

A9 iw

TEST AX, imm16

I

Valid

Valid

AND imm16 with AX; set SF, ZF, PF according to result.

A9 id

TEST EAX, imm32

I

Valid

Valid

AND imm32 with EAX; set SF, ZF, PF according to result.

REX.W + A9 id

TEST RAX, imm32

I

Valid

N.E.

AND imm32 sign-extended to 64-bits with RAX; set SF, ZF, PF according to result.

F6 /0 ib

TEST r/m8, imm8

MI

Valid

Valid

AND imm8 with r/m8; set SF, ZF, PF according to result.

REX + F6 /0 ib

TEST r/m8*, imm8

MI

Valid

N.E.

AND imm8 with r/m8; set SF, ZF, PF according to result.

F7 /0 iw

TEST r/m16, imm16

MI

Valid

Valid

AND imm16 with r/m16; set SF, ZF, PF according to result.

F7 /0 id

TEST r/m32, imm32

MI

Valid

Valid

AND imm32 with r/m32; set SF, ZF, PF according to result.

REX.W + F7 /0 id

TEST r/m64, imm32

MI

Valid

N.E.

AND imm32 sign-extended to 64-bits with

r/m64; set SF, ZF, PF according to result.

84 /r

TEST r/m8, r8

MR

Valid

Valid

AND r8 with r/m8; set SF, ZF, PF according to result.

REX + 84 /r

TEST r/m8*, r8*

MR

Valid

N.E.

AND r8 with r/m8; set SF, ZF, PF according to result.

85 /r

TEST r/m16, r16

MR

Valid

Valid

AND r16 with r/m16; set SF, ZF, PF according to result.

85 /r

TEST r/m32, r32

MR

Valid

Valid

AND r32 with r/m32; set SF, ZF, PF according to result.

REX.W + 85 /r

TEST r/m64, r64

MR

Valid

N.E.

AND r64 with r/m64; set SF, ZF, PF according to result.

NOTES:

* In 64-bit mode, r/m8 can not be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.



Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

I

AL/AX/EAX/RAX

imm8/16/32

NA

NA

MI

ModRM:r/m (r)

imm8/16/32

NA

NA

MR

ModRM:r/m (r)

ModRM:reg (r)

NA

NA


Description

Computes the bit-wise logical AND of first operand (source 1 operand) and the second operand (source 2 operand) and sets the SF, ZF, and PF status flags according to the result. The result is then discarded.

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See the summary chart at the beginning of this section for encoding data and limits.



Operation

TEMP SRC1 AND SRC2; SF MSB(TEMP);

IF TEMP 0 THEN ZF 1; ELSE ZF 0;

FI:

PF BitwiseXNOR(TEMP[0:7]);

CF 0;

OF 0;

(* AF is undefined *)


Flags Affected

The OF and CF flags are set to 0. The SF, ZF, and PF flags are set according to the result (see the “Operation” section above). The state of the AF flag is undefined.


Protected Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

If the DS, ES, FS, or GS register contains a NULL segment selector.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


Real-Address Mode Exceptions

#GP If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS If a memory operand effective address is outside the SS segment limit.

#UD If the LOCK prefix is used.


Virtual-8086 Mode Exceptions

#GP(0) If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit.

#SS(0) If a memory operand effective address is outside the SS segment limit.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made.

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions

Same exceptions as in protected mode.


64-Bit Mode Exceptions

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#GP(0) If the memory address is in a non-canonical form.

#PF(fault-code) If a page fault occurs.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.

#UD If the LOCK prefix is used.


TZCNT — Count the Number of Trailing Zero Bits

Opcode/ Instruction

Op/ En

64/32

-bit Mode

CPUID

Feature Flag

Description

F3 0F BC /r TZCNT r16, r/m16

A

V/V

BMI1

Count the number of trailing zero bits in r/m16, return result in r16.

F3 0F BC /r TZCNT r32, r/m32

A

V/V

BMI1

Count the number of trailing zero bits in r/m32, return result in r32.

F3 REX.W 0F BC /r TZCNT r64, r/m64

A

V/N.E.

BMI1

Count the number of trailing zero bits in r/m64, return result in r64.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

A

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

TZCNT counts the number of trailing least significant zero bits in source operand (second operand) and returns the result in destination operand (first operand). TZCNT is an extension of the BSF instruction. The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.


Operation

temp 0

DEST 0

DO WHILE ( (temp < OperandSize) and (SRC[ temp] = 0) )


temp temp +1 DEST DEST+ 1

OD


IF DEST = OperandSize CF 1

ELSE

CF 0

FI


IF DEST = 0

ZF 1 ELSE

ZF 0

FI


Flags Affected

ZF is set to 1 in case of zero output (least significant bit of the source is set), and to 0 otherwise, CF is set to 1 if the input was zero and cleared otherwise. OF, SF, PF and AF flags are undefined.


Intel C/C++ Compiler Intrinsic Equivalent

TZCNT: unsigned int32 _tzcnt_u32(unsigned int32 src); TZCNT: unsigned int64 _tzcnt_u64(unsigned int64 src);



Protected Mode Exceptions

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.

If the DS, ES, FS, or GS register is used to access memory and it contains a null segment selector.

#SS(0) For an illegal address in the SS segment.

#PF (fault-code) For a page fault.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.


Real-Address Mode Exceptions

#GP(0) If any part of the operand lies outside of the effective address space from 0 to 0FFFFH.

#SS(0) For an illegal address in the SS segment.


Virtual 8086 Mode Exceptions

#GP(0) If any part of the operand lies outside of the effective address space from 0 to 0FFFFH.

#SS(0) For an illegal address in the SS segment.

#PF (fault-code) For a page fault.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.


Compatibility Mode Exceptions

Same exceptions as in Protected Mode.


64-Bit Mode Exceptions

#GP(0) If the memory address is in a non-canonical form.

#SS(0) If a memory address referencing the SS segment is in a non-canonical form.

#PF (fault-code) For a page fault.

#AC(0) If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.


UCOMISD—Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 2E /r

UCOMISD xmm1, xmm2/m64

A

V/V

SSE2

Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.

VEX.LIG.66.0F.WIG 2E /r

VUCOMISD xmm1, xmm2/m64

A

V/V

AVX

Compare low double-precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.

EVEX.LIG.66.0F.W1 2E /r

VUCOMISD xmm1, xmm2/m64{sae}

B

V/V

AVX512F

Compare low double-precision floating-point values in xmm1 and xmm2/m64 and set the EFLAGS flags accordingly.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r)

ModRM:r/m (r)

NA

NA

B

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Performs an unordered compare of the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory location.

The UCOMISD instruction differs from the COMISD instruction in that it signals a SIMD floating-point invalid oper- ation exception (#I) only when a source operand is an SNaN. The COMISD instruction signals an invalid numeric exception only if a source operand is either an SNaN or a QNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter unpredictable behavior across different processor generations.


Operation

(V)UCOMISD (all versions)

RESULT UnorderedCompare(DEST[63:0] <> SRC[63:0]) { (* Set EFLAGS *) CASE (RESULT) OF

UNORDERED: ZF,PF,CF 111; GREATER_THAN: ZF,PF,CF 000; LESS_THAN: ZF,PF,CF 001; EQUAL: ZF,PF,CF 100;

ESAC;

OF, AF, SF 0; }



Intel C/C++ Compiler Intrinsic Equivalent

VUCOMISD int _mm_comi_round_sd( m128d a, m128d b, int imm, int sae); UCOMISD int _mm_ucomieq_sd( m128d a, m128d b)

UCOMISD int _mm_ucomilt_sd( m128d a, m128d b) UCOMISD int _mm_ucomile_sd( m128d a, m128d b) UCOMISD int _mm_ucomigt_sd( m128d a, m128d b) UCOMISD int _mm_ucomige_sd( m128d a, m128d b) UCOMISD int _mm_ucomineq_sd( m128d a, m128d b)


SIMD Floating-Point Exceptions

Invalid (if SNaN operands), Denormal


Other Exceptions

VEX-encoded instructions, see Exceptions Type 3; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instructions, see Exceptions Type E3NF.


UCOMISS—Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 2E /r

UCOMISS xmm1, xmm2/m32

A

V/V

SSE

Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.

VEX.LIG.0F.WIG 2E /r

VUCOMISS xmm1, xmm2/m32

A

V/V

AVX

Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.

EVEX.LIG.0F.W0 2E /r

VUCOMISS xmm1, xmm2/m32{sae}

B

V/V

AVX512F

Compare low single-precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r)

ModRM:r/m (r)

NA

NA

B

Tuple1 Scalar

ModRM:reg (w)

ModRM:r/m (r)

NA

NA


Description

Compares the single-precision floating-point values in the low doublewords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unor- dered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unor- dered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.

The UCOMISS instruction differs from the COMISS instruction in that it signals a SIMD floating-point invalid opera- tion exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid numeric excep- tion when a source operand is either a QNaN or SNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter unpredictable behavior across different processor generations.


Operation

(V)UCOMISS (all versions)

RESULT UnorderedCompare(DEST[31:0] <> SRC[31:0]) { (* Set EFLAGS *) CASE (RESULT) OF

UNORDERED: ZF,PF,CF 111; GREATER_THAN: ZF,PF,CF 000; LESS_THAN: ZF,PF,CF 001; EQUAL: ZF,PF,CF 100;

ESAC;

OF, AF, SF 0; }



Intel C/C++ Compiler Intrinsic Equivalent

VUCOMISS int _mm_comi_round_ss( m128 a, m128 b, int imm, int sae); UCOMISS int _mm_ucomieq_ss( m128 a, m128 b);

UCOMISS int _mm_ucomilt_ss( m128 a, m128 b); UCOMISS int _mm_ucomile_ss( m128 a, m128 b); UCOMISS int _mm_ucomigt_ss( m128 a, m128 b); UCOMISS int _mm_ucomige_ss( m128 a, m128 b); UCOMISS int _mm_ucomineq_ss( m128 a, m128 b);


SIMD Floating-Point Exceptions

Invalid (if SNaN Operands), Denormal


Other Exceptions

VEX-encoded instructions, see Exceptions Type 3; additionally

#UD If VEX.vvvv != 1111B.

EVEX-encoded instructions, see Exceptions Type E3NF.


UD—Undefined Instruction

Opcode

Instruction

Op/ En

64-Bit Mode

Compat/ Leg Mode

Description

0F FF /r

UD01 r32, r/m32

RM

Valid

Valid

Raise invalid opcode exception.

0F B9 /r

UD1 r32, r/m32

RM

Valid

Valid

Raise invalid opcode exception.

0F 0B

UD2

ZO

Valid

Valid

Raise invalid opcode exception.

NOTES:

1. Some older processors decode the UD0 instruction without a ModR/M byte. As a result, those processors would deliver an invalid- opcode exception instead of a fault on instruction fetch when the instruction with a ModR/M byte (and any implied bytes) would cross a page or segment boundary.


Instruction Operand Encoding

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

ZO

NA

NA

NA

NA

RM

ModRM:reg (r)

ModRM:r/m (r)

NA

NA


Description

Generates an invalid opcode exception. This instruction is provided for software testing to explicitly generate an invalid opcode exception. The opcodes for this instruction are reserved for this purpose.

Other than raising the invalid opcode exception, this instruction has no effect on processor state or memory.

Even though it is the execution of the UD instruction that causes the invalid opcode exception, the instruction pointer saved by delivery of the exception references the UD instruction (and not the following instruction).

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.


Operation

#UD (* Generates invalid opcode exception *);


Flags Affected

None.


Exceptions (All Operating Modes)

#UD Raises an invalid opcode exception in all operating modes.


UNPCKHPD—Unpack and Interleave High Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 15 /r

UNPCKHPD xmm1, xmm2/m128

A

V/V

SSE2

Unpacks and Interleaves double-precision floating-point values from high quadwords of xmm1 and xmm2/m128.

VEX.NDS.128.66.0F.WIG 15 /r

VUNPCKHPD xmm1,xmm2, xmm3/m128

B

V/V

AVX

Unpacks and Interleaves double-precision floating-point values from high quadwords of xmm2 and xmm3/m128.

VEX.NDS.256.66.0F.WIG 15 /r

VUNPCKHPD ymm1,ymm2, ymm3/m256

B

V/V

AVX

Unpacks and Interleaves double-precision floating-point values from high quadwords of ymm2 and ymm3/m256.

EVEX.NDS.128.66.0F.W1 15 /r

VUNPCKHPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves double precision floating-point values from high quadwords of xmm2 and xmm3/m128/m64bcst subject to writemask k1.

EVEX.NDS.256.66.0F.W1 15 /r

VUNPCKHPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves double precision floating-point values from high quadwords of ymm2 and ymm3/m256/m64bcst subject to writemask k1.

EVEX.NDS.512.66.0F.W1 15 /r

VUNPCKHPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

C

V/V

AVX512F

Unpacks and Interleaves double-precision floating-point values from high quadwords of zmm2 and zmm3/m512/m64bcst subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Performs an interleaved unpack of the high double-precision floating-point values from the first source operand and the second source operand. See Figure 4-15 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a ZMM register, conditionally updated using writemask k1.

EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a YMM register, conditionally updated using writemask k1.

EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a XMM register, conditionally updated using writemask k1.



Operation

VUNPCKHPD (EVEX encoded versions when SRC2 is a register)

(KL, VL) = (2, 128), (4, 256), (8, 512) IF VL >= 128

TMP_DEST[63:0] SRC1[127:64] TMP_DEST[127:64] SRC2[127:64]

FI;

IF VL >= 256

TMP_DEST[191:128] SRC1[255:192] TMP_DEST[255:192] SRC2[255:192]

FI;

IF VL >= 512

TMP_DEST[319:256] SRC1[383:320] TMP_DEST[383:320] SRC2[383:320] TMP_DEST[447:384] SRC1[511:448] TMP_DEST[511:448] SRC2[511:448]

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VUNPCKHPD (EVEX encoded version when SRC2 is memory)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1)

THEN TMP_SRC2[i+63:i] SRC2[63:0] ELSE TMP_SRC2[i+63:i] SRC2[i+63:i]

FI;

ENDFOR;

IF VL >= 128

TMP_DEST[63:0] SRC1[127:64] TMP_DEST[127:64] TMP_SRC2[127:64]

FI;

IF VL >= 256

TMP_DEST[191:128] SRC1[255:192] TMP_DEST[255:192] TMP_SRC2[255:192]

FI;

IF VL >= 512

TMP_DEST[319:256] SRC1[383:320] TMP_DEST[383:320] TMP_SRC2[383:320] TMP_DEST[447:384] SRC1[511:448] TMP_DEST[511:448] TMP_SRC2[511:448]

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VUNPCKHPD (VEX.256 encoded version) DEST[63:0] SRC1[127:64] DEST[127:64] SRC2[127:64] DEST[191:128]SRC1[255:192] DEST[255:192]SRC2[255:192] DEST[MAXVL-1:256] 0


VUNPCKHPD (VEX.128 encoded version) DEST[63:0] SRC1[127:64] DEST[127:64] SRC2[127:64] DEST[MAXVL-1:128] 0


UNPCKHPD (128-bit Legacy SSE version)

DEST[63:0] SRC1[127:64] DEST[127:64] SRC2[127:64]

DEST[MAXVL-1:128] (Unmodified)



Intel C/C++ Compiler Intrinsic Equivalent

VUNPCKHPD m512d _mm512_unpackhi_pd( m512d a, m512d b);

VUNPCKHPD m512d _mm512_mask_unpackhi_pd( m512d s, mmask8 k, m512d a, m512d b); VUNPCKHPD m512d _mm512_maskz_unpackhi_pd( mmask8 k, m512d a, m512d b); VUNPCKHPD m256d _mm256_unpackhi_pd( m256d a, m256d b)

VUNPCKHPD m256d _mm256_mask_unpackhi_pd( m256d s, mmask8 k, m256d a, m256d b); VUNPCKHPD m256d _mm256_maskz_unpackhi_pd( mmask8 k, m256d a, m256d b); UNPCKHPD m128d _mm_unpackhi_pd( m128d a, m128d b)

VUNPCKHPD m128d _mm_mask_unpackhi_pd( m128d s, mmask8 k, m128d a, m128d b); VUNPCKHPD m128d _mm_maskz_unpackhi_pd( mmask8 k, m128d a, m128d b);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4NF.


UNPCKHPS—Unpack and Interleave High Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 15 /r

UNPCKHPS xmm1, xmm2/m128

A

V/V

SSE

Unpacks and Interleaves single-precision floating-point values from high quadwords of xmm1 and xmm2/m128.

VEX.NDS.128.0F.WIG 15 /r

VUNPCKHPS xmm1, xmm2, xmm3/m128

B

V/V

AVX

Unpacks and Interleaves single-precision floating-point values from high quadwords of xmm2 and xmm3/m128.

VEX.NDS.256.0F.WIG 15 /r

VUNPCKHPS ymm1, ymm2, ymm3/m256

B

V/V

AVX

Unpacks and Interleaves single-precision floating-point values from high quadwords of ymm2 and ymm3/m256.

EVEX.NDS.128.0F.W0 15 /r

VUNPCKHPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves single-precision floating-point values from high quadwords of xmm2 and xmm3/m128/m32bcst and write result to xmm1 subject to writemask k1.

EVEX.NDS.256.0F.W0 15 /r

VUNPCKHPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves single-precision floating-point values from high quadwords of ymm2 and ymm3/m256/m32bcst and write result to ymm1 subject to writemask k1.

EVEX.NDS.512.0F.W0 15 /r

VUNPCKHPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

C

V/V

AVX512F

Unpacks and Interleaves single-precision floating-point values from high quadwords of zmm2 and zmm3/m512/m32bcst and write result to zmm1 subject to writemask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Performs an interleaved unpack of the high single-precision floating-point values from the first source operand and the second source operand.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first source operand and destination operands are YMM registers.


image

X7

X6

X5

X4

X3

X2

X1

X0

Y7

Y6

Y5

Y4

Y3

Y2

Y1

Y0

SRC1 SRC2


Y7

X7

Y6

X6

Y3

X3

Y2

X2

DEST


Figure 4-27. VUNPCKHPS Operation



EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a ZMM register, conditionally updated using writemask k1.

EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a YMM register, conditionally updated using writemask k1.

EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a XMM register, conditionally updated using writemask k1.


Operation

VUNPCKHPS (EVEX encoded version when SRC2 is a register)

(KL, VL) = (4, 128), (8, 256), (16, 512) IF VL >= 128

TMP_DEST[31:0] SRC1[95:64] TMP_DEST[63:32] SRC2[95:64] TMP_DEST[95:64] SRC1[127:96] TMP_DEST[127:96] SRC2[127:96]

FI;

IF VL >= 256

TMP_DEST[159:128] SRC1[223:192] TMP_DEST[191:160] SRC2[223:192] TMP_DEST[223:192] SRC1[255:224] TMP_DEST[255:224] SRC2[255:224]

FI;

IF VL >= 512

TMP_DEST[287:256] SRC1[351:320] TMP_DEST[319:288] SRC2[351:320] TMP_DEST[351:320] SRC1[383:352] TMP_DEST[383:352] SRC2[383:352] TMP_DEST[415:384] SRC1[479:448] TMP_DEST[447:416] SRC2[479:448] TMP_DEST[479:448] SRC1[511:480] TMP_DEST[511:480] SRC2[511:480]

FI;



FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VUNPCKHPS (EVEX encoded version when SRC2 is memory)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 32

IF (EVEX.b = 1)

THEN TMP_SRC2[i+31:i] SRC2[31:0] ELSE TMP_SRC2[i+31:i] SRC2[i+31:i]

FI;

ENDFOR;

IF VL >= 128

TMP_DEST[31:0] SRC1[95:64] TMP_DEST[63:32] TMP_SRC2[95:64] TMP_DEST[95:64] SRC1[127:96] TMP_DEST[127:96] TMP_SRC2[127:96]

FI;

IF VL >= 256

TMP_DEST[159:128] SRC1[223:192] TMP_DEST[191:160] TMP_SRC2[223:192] TMP_DEST[223:192] SRC1[255:224] TMP_DEST[255:224] TMP_SRC2[255:224]

FI;

IF VL >= 512

TMP_DEST[287:256] SRC1[351:320] TMP_DEST[319:288] TMP_SRC2[351:320] TMP_DEST[351:320] SRC1[383:352] TMP_DEST[383:352] TMP_SRC2[383:352] TMP_DEST[415:384] SRC1[479:448] TMP_DEST[447:416] TMP_SRC2[479:448] TMP_DEST[479:448] SRC1[511:480] TMP_DEST[511:480] TMP_SRC2[511:480]

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0



FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VUNPCKHPS (VEX.256 encoded version)

DEST[31:0] SRC1[95:64] DEST[63:32] SRC2[95:64] DEST[95:64] SRC1[127:96] DEST[127:96] SRC2[127:96] DEST[159:128] SRC1[223:192] DEST[191:160] SRC2[223:192] DEST[223:192] SRC1[255:224] DEST[255:224] SRC2[255:224] DEST[MAXVL-1:256] 0


VUNPCKHPS (VEX.128 encoded version)

DEST[31:0] SRC1[95:64] DEST[63:32] SRC2[95:64] DEST[95:64] SRC1[127:96] DEST[127:96] SRC2[127:96] DEST[MAXVL-1:128] 0


UNPCKHPS (128-bit Legacy SSE version)

DEST[31:0] SRC1[95:64] DEST[63:32] SRC2[95:64] DEST[95:64] SRC1[127:96] DEST[127:96] SRC2[127:96]

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VUNPCKHPS m512 _mm512_unpackhi_ps( m512 a, m512 b);

VUNPCKHPS m512 _mm512_mask_unpackhi_ps( m512 s, mmask16 k, m512 a, m512 b); VUNPCKHPS m512 _mm512_maskz_unpackhi_ps( mmask16 k, m512 a, m512 b); VUNPCKHPS m256 _mm256_unpackhi_ps ( m256 a, m256 b);

VUNPCKHPS m256 _mm256_mask_unpackhi_ps( m256 s, mmask8 k, m256 a, m256 b); VUNPCKHPS m256 _mm256_maskz_unpackhi_ps( mmask8 k, m256 a, m256 b); UNPCKHPS m128 _mm_unpackhi_ps ( m128 a, m128 b);

VUNPCKHPS m128 _mm_mask_unpackhi_ps( m128 s, mmask8 k, m128 a, m128 b); VUNPCKHPS m128 _mm_maskz_unpackhi_ps( mmask8 k, m128 a, m128 b);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4NF.


UNPCKLPD—Unpack and Interleave Low Packed Double-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

66 0F 14 /r

UNPCKLPD xmm1, xmm2/m128

A

V/V

SSE2

Unpacks and Interleaves double-precision floating-point values from low quadwords of xmm1 and xmm2/m128.

VEX.NDS.128.66.0F.WIG 14 /r

VUNPCKLPD xmm1,xmm2, xmm3/m128

B

V/V

AVX

Unpacks and Interleaves double-precision floating-point values from low quadwords of xmm2 and xmm3/m128.

VEX.NDS.256.66.0F.WIG 14 /r

VUNPCKLPD ymm1,ymm2, ymm3/m256

B

V/V

AVX

Unpacks and Interleaves double-precision floating-point values from low quadwords of ymm2 and ymm3/m256.

EVEX.NDS.128.66.0F.W1 14 /r

VUNPCKLPD xmm1 {k1}{z}, xmm2, xmm3/m128/m64bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves double precision floating-point values from low quadwords of xmm2 and xmm3/m128/m64bcst subject to write mask k1.

EVEX.NDS.256.66.0F.W1 14 /r

VUNPCKLPD ymm1 {k1}{z}, ymm2, ymm3/m256/m64bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves double precision floating-point values from low quadwords of ymm2 and ymm3/m256/m64bcst subject to write mask k1.

EVEX.NDS.512.66.0F.W1 14 /r

VUNPCKLPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst

C

V/V

AVX512F

Unpacks and Interleaves double-precision floating-point values from low quadwords of zmm2 and zmm3/m512/m64bcst subject to write mask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Performs an interleaved unpack of the low double-precision floating-point values from the first source operand and the second source operand.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a ZMM register, conditionally updated using writemask k1.

EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a YMM register, conditionally updated using writemask k1.

EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina- tion operand is a XMM register, conditionally updated using writemask k1.



Operation

VUNPCKLPD (EVEX encoded versions when SRC2 is a register)

(KL, VL) = (2, 128), (4, 256), (8, 512) IF VL >= 128

TMP_DEST[63:0] SRC1[63:0] TMP_DEST[127:64] SRC2[63:0]

FI;

IF VL >= 256

TMP_DEST[191:128] SRC1[191:128] TMP_DEST[255:192] SRC2[191:128]

FI;

IF VL >= 512

TMP_DEST[319:256] SRC1[319:256] TMP_DEST[383:320] SRC2[319:256] TMP_DEST[447:384] SRC1[447:384] TMP_DEST[511:448] SRC2[447:384]

FI;


FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0



VUNPCKLPD (EVEX encoded version when SRC2 is memory)

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j 0 TO KL-1

i j * 64

IF (EVEX.b = 1)

THEN TMP_SRC2[i+63:i] SRC2[63:0] ELSE TMP_SRC2[i+63:i] SRC2[i+63:i]

FI;

ENDFOR;

IF VL >= 128

TMP_DEST[63:0] SRC1[63:0] TMP_DEST[127:64] TMP_SRC2[63:0]

FI;

IF VL >= 256

TMP_DEST[191:128] SRC1[191:128] TMP_DEST[255:192] TMP_SRC2[191:128]

FI;

IF VL >= 512

TMP_DEST[319:256] SRC1[319:256] TMP_DEST[383:320] TMP_SRC2[319:256] TMP_DEST[447:384] SRC1[447:384] TMP_DEST[511:448] TMP_SRC2[447:384]

FI;

FOR j 0 TO KL-1

i j * 64

IF k1[j] OR *no writemask*

THEN DEST[i+63:i] TMP_DEST[i+63:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+63:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+63:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VUNPCKLPD (VEX.256 encoded version)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0] DEST[191:128] SRC1[191:128] DEST[255:192] SRC2[191:128] DEST[MAXVL-1:256] 0


VUNPCKLPD (VEX.128 encoded version)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0] DEST[MAXVL-1:128] 0


UNPCKLPD (128-bit Legacy SSE version)

DEST[63:0] SRC1[63:0] DEST[127:64] SRC2[63:0]

DEST[MAXVL-1:128] (Unmodified)



Intel C/C++ Compiler Intrinsic Equivalent

VUNPCKLPD m512d _mm512_unpacklo_pd( m512d a, m512d b);

VUNPCKLPD m512d _mm512_mask_unpacklo_pd( m512d s, mmask8 k, m512d a, m512d b); VUNPCKLPD m512d _mm512_maskz_unpacklo_pd( mmask8 k, m512d a, m512d b); VUNPCKLPD m256d _mm256_unpacklo_pd( m256d a, m256d b)

VUNPCKLPD m256d _mm256_mask_unpacklo_pd( m256d s, mmask8 k, m256d a, m256d b); VUNPCKLPD m256d _mm256_maskz_unpacklo_pd( mmask8 k, m256d a, m256d b); UNPCKLPD m128d _mm_unpacklo_pd( m128d a, m128d b)

VUNPCKLPD m128d _mm_mask_unpacklo_pd( m128d s, mmask8 k, m128d a, m128d b); VUNPCKLPD m128d _mm_maskz_unpacklo_pd( mmask8 k, m128d a, m128d b);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4NF.


UNPCKLPS—Unpack and Interleave Low Packed Single-Precision Floating-Point Values

Opcode/ Instruction

Op / En

64/32

bit Mode Support

CPUID

Feature Flag

Description

NP 0F 14 /r

UNPCKLPS xmm1, xmm2/m128

A

V/V

SSE

Unpacks and Interleaves single-precision floating-point values from low quadwords of xmm1 and xmm2/m128.

VEX.NDS.128.0F.WIG 14 /r

VUNPCKLPS xmm1,xmm2, xmm3/m128

B

V/V

AVX

Unpacks and Interleaves single-precision floating-point values from low quadwords of xmm2 and xmm3/m128.

VEX.NDS.256.0F.WIG 14 /r VUNPCKLPS

ymm1,ymm2,ymm3/m256

B

V/V

AVX

Unpacks and Interleaves single-precision floating-point values from low quadwords of ymm2 and ymm3/m256.

EVEX.NDS.128.0F.W0 14 /r

VUNPCKLPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves single-precision floating-point values from low quadwords of xmm2 and xmm3/mem and write result to xmm1 subject to write mask k1.

EVEX.NDS.256.0F.W0 14 /r

VUNPCKLPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst

C

V/V

AVX512VL AVX512F

Unpacks and Interleaves single-precision floating-point values from low quadwords of ymm2 and ymm3/mem and write result to ymm1 subject to write mask k1.

EVEX.NDS.512.0F.W0 14 /r

VUNPCKLPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

C

V/V

AVX512F

Unpacks and Interleaves single-precision floating-point values from low quadwords of zmm2 and zmm3/m512/m32bcst and write result to zmm1 subject to write mask k1.


Instruction Operand Encoding

Op/En

Tuple Type

Operand 1

Operand 2

Operand 3

Operand 4

A

NA

ModRM:reg (r, w)

ModRM:r/m (r)

NA

NA

B

NA

ModRM:reg (w)

VEX.vvvv (r)

ModRM:r/m (r)

NA

C

Full

ModRM:reg (w)

EVEX.vvvv (r)

ModRM:r/m (r)

NA


Description

Performs an interleaved unpack of the low single-precision floating-point values from the first source operand and the second source operand.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti- nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are zeroed.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.


SRC1

X7

X6

X5

X4


X3

X2

X1

X0













SRC2

Y7

Y6

Y5

Y4


Y3

Y2

Y1

Y

0


image

Y5

X5

Y4

X4

Y1

X1

Y0

X0

DEST


Figure 4-28. VUNPCKLPS Operation



EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a ZMM register, conditionally updated using writemask k1.

EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a YMM register, conditionally updated using writemask k1.

EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina- tion operand is a XMM register, conditionally updated using writemask k1.


Operation

VUNPCKLPS (EVEX encoded version when SRC2 is a ZMM register)

(KL, VL) = (4, 128), (8, 256), (16, 512) IF VL >= 128

TMP_DEST[31:0] SRC1[31:0] TMP_DEST[63:32] SRC2[31:0] TMP_DEST[95:64] SRC1[63:32] TMP_DEST[127:96] SRC2[63:32]

FI;

IF VL >= 256

TMP_DEST[159:128] SRC1[159:128] TMP_DEST[191:160] SRC2[159:128] TMP_DEST[223:192] SRC1[191:160] TMP_DEST[255:224] SRC2[191:160]

FI;

IF VL >= 512

TMP_DEST[287:256] SRC1[287:256] TMP_DEST[319:288] SRC2[287:256] TMP_DEST[351:320] SRC1[319:288] TMP_DEST[383:352] SRC2[319:288] TMP_DEST[415:384] SRC1[415:384] TMP_DEST[447:416] SRC2[415:384] TMP_DEST[479:448] SRC1[447:416] TMP_DEST[511:480] SRC2[447:416]

FI;

FOR j 0 TO KL-1

i j * 32



IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0

FI

FI;

ENDFOR

DEST[MAXVL-1:VL] 0


VUNPCKLPS (EVEX encoded version when SRC2 is memory)

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j 0 TO KL-1

i j * 31

IF (EVEX.b = 1)

THEN TMP_SRC2[i+31:i] SRC2[31:0] ELSE TMP_SRC2[i+31:i] SRC2[i+31:i]

FI;

ENDFOR;

IF VL >= 128

TMP_DEST[31:0] SRC1[31:0] TMP_DEST[63:32] TMP_SRC2[31:0] TMP_DEST[95:64] SRC1[63:32] TMP_DEST[127:96] TMP_SRC2[63:32] FI;

IF VL >= 256

TMP_DEST[159:128] SRC1[159:128] TMP_DEST[191:160] TMP_SRC2[159:128] TMP_DEST[223:192] SRC1[191:160] TMP_DEST[255:224] TMP_SRC2[191:160]

FI;

IF VL >= 512

TMP_DEST[287:256] SRC1[287:256] TMP_DEST[319:288] TMP_SRC2[287:256] TMP_DEST[351:320] SRC1[319:288] TMP_DEST[383:352] TMP_SRC2[319:288] TMP_DEST[415:384] SRC1[415:384] TMP_DEST[447:416] TMP_SRC2[415:384] TMP_DEST[479:448] SRC1[447:416] TMP_DEST[511:480] TMP_SRC2[447:416]

FI;

FOR j 0 TO KL-1

i j * 32

IF k1[j] OR *no writemask*

THEN DEST[i+31:i] TMP_DEST[i+31:i] ELSE

IF *merging-masking* ; merging-masking THEN *DEST[i+31:i] remains unchanged*

ELSE *zeroing-masking* ; zeroing-masking DEST[i+31:i] 0

FI

FI;



ENDFOR

DEST[MAXVL-1:VL] 0


UNPCKLPS (VEX.256 encoded version)

DEST[31:0] SRC1[31:0] DEST[63:32] SRC2[31:0] DEST[95:64] SRC1[63:32] DEST[127:96] SRC2[63:32] DEST[159:128] SRC1[159:128] DEST[191:160] SRC2[159:128] DEST[223:192] SRC1[191:160] DEST[255:224] SRC2[191:160] DEST[MAXVL-1:256] 0


VUNPCKLPS (VEX.128 encoded version)

DEST[31:0] SRC1[31:0] DEST[63:32] SRC2[31:0] DEST[95:64] SRC1[63:32] DEST[127:96] SRC2[63:32] DEST[MAXVL-1:128] 0


UNPCKLPS (128-bit Legacy SSE version)

DEST[31:0] SRC1[31:0] DEST[63:32] SRC2[31:0] DEST[95:64] SRC1[63:32] DEST[127:96] SRC2[63:32]

DEST[MAXVL-1:128] (Unmodified)


Intel C/C++ Compiler Intrinsic Equivalent

VUNPCKLPS m512 _mm512_unpacklo_ps( m512 a, m512 b);

VUNPCKLPS m512 _mm512_mask_unpacklo_ps( m512 s, mmask16 k, m512 a, m512 b); VUNPCKLPS m512 _mm512_maskz_unpacklo_ps( mmask16 k, m512 a, m512 b); VUNPCKLPS m256 _mm256_unpacklo_ps ( m256 a, m256 b);

VUNPCKLPS m256 _mm256_mask_unpacklo_ps( m256 s, mmask8 k, m256 a, m256 b); VUNPCKLPS m256 _mm256_maskz_unpacklo_ps( mmask8 k, m256 a, m256 b); UNPCKLPS m128 _mm_unpacklo_ps ( m128 a, m128 b);

VUNPCKLPS m128 _mm_mask_unpacklo_ps( m128 s, mmask8 k, m128 a, m128 b); VUNPCKLPS m128 _mm_maskz_unpacklo_ps( mmask8 k, m128 a, m128 b);


SIMD Floating-Point Exceptions

None


Other Exceptions

Non-EVEX-encoded instructions, see Exceptions Type 4. EVEX-encoded instructions, see Exceptions Type E4NF.